Archive for May, 2012

SPEC RDBMS

2012/05/01

SPEC RDBMS(N, V) =

– – I wrote out this specification many years ago, partially to practice writing formal specifications, and partly to get clear definitions of the basic operations on tables in a relational database management system (RDBMS). Too many times, I see vague definitions and a couple of examples.  I wanted very precise definitions.  Somehow I can’t find those original notes, but since I was looking at Hadoop and Hbase, I decided to redo them.

– – The notation used is a variant of one developed by Butler Lampson, who taught it to me when we were both at Digital Equipment Corporation. It is simple and intuitive, and it is defined using simple set notation. Types have both state and operations (functions and procedures) defined on them to form a class. The parameters N and V in RDBMS(N, V) are classes with implied operators. Think of N as “names” which have an order operation and a concatenation operation. V is a class of “values” which can be thought of as a union of names and numbers. There is an order operation defined on V. where the numbers sort before the names.  Other interpretations of V are possible.

– – The notation X → Y denotes partial functions from X to Y, i.e. a function that is not necessarily defined for all x in X. If f: X → Y is a partial function, then f!x for x in X means that f is defined at x, and the value is then denoted f(x) as usual. We think of f in two ways: First f is a set of pairs (x, y) such that If (x,y1) In f And (x,y2) In f Then y1=y2. The second way is to think of X = {x1, x2, …xn} so that (f(x1), f(x2), … f(xn)) is a point in n-space. This is particularly fruitful when X is a set of names and the n axes are labeled with the names in X. This second way of thinking will be used in this spec for a RDBMS.

– – The notation VAR x: X defines a state variable of type X. This construct can be followed with a “:=” to initialize x, or it can be followed with a vertical bar “|” preceding a condition. One reads the latter as “Choose an x such that condition is true.” No method to construct such an x need be given, and all construction options are open to the implementor. Details are discussed in [1].

Type

ColumnNames = CN = N

Row = Tuple = CN → V

– – think of a row or tuple as an n-tuple whose components are labeled by column names. Geometrically, a tuple is a point in n-space, and it is sometimes helpful to think geometrically about such points. On the other hand, thinking of this n-tuple as a row vector, and soon as a row of a matrix is also fruitful. Note that both of these interpretations needs an order defined on the column names for visualization. This order is not part of the formal definition. When dealing with multiple partial functions r: CN → V , it is useful to think of their domain as large enough to include the domains of all rows of interest, in which case, a particular r may not be defined on a particular column name cn. One often says then that r is “Null” at cn.

Type

Table = Relation = Set(Row) = Set(Tuple)

– – There are two ways to think of a table (= relation). The first is as a set of points in n-space, and the second is as a matrix whose columns have column-names, and whose rows may or may not have names. There is no defined order of columns or of rows. Thus the first of these interpretations needs to order the columns for visualization of points in n-space, and the second also needs to order the rows to obtain a matrix that one can write down and/or print. Entries in this matrix where the row partial function is not defined at a particular column name are sometimes said to have the Null value; although in these notes Null is not otherwise defined and is definitely not an element of V. A Null represents a lack of a defined value. Note that since this definition of Table is a set, a table cannot have identical rows. If this is desired for identical rows r and r’, then an additional column-name cn needs to be added with r(cn) not equal to r'(cn). With the image of a table T as a matrix T with orders on the column names c1, …, cn and rows r1, … rm, one often identifies the jth column name with the jth column whose value in row ri is tij = ri(cj).

Function domain(T: Table) Returns Set(CN) = Return (Union {r.domain| r In T})

Notation, we write T.domain for domain(T)

Function Projection(C: Set(CN), T: Table) Returns Table =
Begin
VAR T’ : Table := { }
For r In T Do
..VAR r’: Row := { }
..For cn In C Do If r!cn Then r'(cn) := r(cn) Done
..If r’ = { } Then Skip – – get a new r
..T’ += r’
..Done
Return(T’)
End – – Projection

– – Here are three interpretations: 1. If one views T as a set of points in n-space, and C is a subset of the axes, then T’ is the projection of T onto the orthogonal space whose axes are named by C. 2. Viewing T as a matrix, then Projection(C,T) is just the matrix whose columns are the subset of those of T that are defined by C and with any all-Null rows omitted. 3. In SQL, if cn1, cn2, … cnn are the column names in C then one writes SELECT cn1, cn2, … cnn FROM T.

Type

Predicate = Function(Row) Returns Boolean

– – A predicate p(r) is usually an equality or inequality expression involving the values of r.

Function Subset(T: Table, p: Predicate) Returns Table =
Begin
VAR T’: Table := { }
For r In T Do If p(r) Then T’ += r Done
Return(T’)
End – – Subset

– – As above there are three interpretations: 1. If the r In T are viewed as points in n-space, and p defines the cloud of “true” points, then T’ is the set of points inside this cloud. 2. Viewing T as a matrix, T’ is the subset of rows r of T for which p(r) is true. 3. In SQL one writes SELECT * From T WHERE p(r). The * indicates to select all columns.

– – Most SELECT statements involving only one table are a projection of a subset, written
SELECT cn1, cn2, … cnn FROM T WHERE p(r). In other words, select the rows where p is true and then select some columns from the result.

Function Union (A: Table, B: Table) Returns Table =
Begin
VAR T: Table := {r: Row | r In A Or r In B}
Return(T)
End – – Union

– – If A and B are sets of points in n-space, then Union(A, B) is their set union. If A and B are thought of as matrices with the same column names, then Union(A, B) is the matrix whose rows are the union of the rows in A with the rows in B. If A and B have differences in their column names, then one first takes the union of their domains, extends each row of A and B with Nulls for the column names on which the row is not defined, and then takes the union of the rows. An example of Union is given in the discussion of OuterJoin below.

Function Product(A: Table, B: Table) Returns Table =
Begin
VAR AN: Set(CN) := {a.cn | cn In A.domain}, BN: Set(CN) := {b.cn | cn In B.domain}
VAR T: Table := { }
For (r,s) In (A,B) Do
..VAR t: AN + BN → V := { } – – Note the pair (r,s) defines a new row t.
..For a.cn In AN Do If r!cn Then t(a.cn) := r(cn) Done
..For b.cn In BN Do If s!cn Then t(b.cn) := s(cn) Done
..T += t – – add this newly defined row t to T
..Done
Return(T)
End – – Product

– – The row t of Product(A, B) defined above is often written t = (r, s).

– – First note that AN + BN is the (disjoint) union of the domains of tables A and B. Thus the product table T has the original column names of table A, prefixed with “a.” plus the original column names of table B, prefixed by “b.” Now since (A,B) is the Cartesian product = all pairs (r,s) with r In A and s In B, if table A has m rows and table B has n rows, then table T will have m*n rows.

– – While it blows some people’s minds, there is a geometric interpretation. If we view A as a set of points in m-space and B as a set of points in a (different) n-space, then Product(A, B) is the Cartesian product of these two sets of points.
– – In practice, while conceptually appealing, Product(A,B) is rarely formed. It is just too big. The operation that most frequently is used is called a Join. Most Joins of two tables A and B can be defined as projections of subsets of Product(A, B). Now a Join isn’t computed by first computing the product, but rather it is computed more directly, and the database folks have done a lot of work optimizing the calculation of Joins. Here we will just define them. At this point, I’ll note that Wikipedia has a very nice article on Joins at http://en.wikipedia.org/wiki/Join_(SQL) . In particular, the author of that article has a simple example, used below, that seems to illustrate all the relevant concepts.

– – The simplest, and the most common Join is defined by a single column name cn that is common both to A and to B. The subset predicate is for t = (r, s) in Product(A, B), p(r,s) is true if and only if t(a.cn) = t(b.cn), i.e., r(cn) = s(cn).

Example

A = Employee table

LastName

DepartmentID

Rafferty 31
Jones 33
Steinberg 33
Robinson 34
Smith 34
John NULL

B = Department table

DepartmentID

DepartmentName

31 Sales
33 Engineering
34 Clerical
35 Marketing

Note the John has not been assigned a DepartmentID, and Marketing (DepartmentID = 35) has no one in it. Since A has 6 rows and B has 4 rows, the Product(A, B) has 24 rows. Since A has 2 columns and B has 2 columns, Product(A, B) has 2+2 = 4 columns. It is a modestly interesting exercise to write out this 24 by 4 matrix.

If cn = DepartmentID, then the Join above is

a.LastName

a.DepartmentID

b.DepartmentID

b.DepartmentName

Rafferty 31 31 Sales
Jones 33 33 Engineering
Steinberg 33 33 Engineering
Robinson 34 34 Clerical
Smith 34 34 Clerical

The reason John and Marketing do not appear in this result is that the predicate will not return True when evaluating Nulls.

The SQL for the above “Inner Join” has two forms. The first has an explicit Join operator and the second has an implicit Join:

SELECT * FROM Employee INNER JOIN Department ON a.DepartmentID = b.DepartmentID;

and

SELECT * FROM Employee, Department WHERE a.DepartmentID = b.DepartmentID;

A longer form of both uses “Employee” for “a” and “Department” for “b”.

Sometimes, the redundant column b.DepartmentID is removed by projection, and the remaining columns are renamed to yield:

LastName

DepartmentID

DepartmentName

Rafferty 31 Sales
Jones 33 Engineering
Steinberg 33 Engineering
Robinson 34 Clerical
Smith 34 Clerical

The SQL/92 for this SELECT * FROM Employee, Department USING DepartmentID.

Unfortunately not all RDBMS systems support the USING clause.

Exercise 0: Write out the spec for

Function InnerJoin(A: Table, B: Table, cn0) Returns Table

directly, i.e. without using Subset and Product. This code is very similar to that of Product.

Definition: A subset of columns C of a table is a primary key if for every r In T

  1. r!cn for every cn In C
  2. For every s In T, If r(cn) = s(cn) for every cn in C, Then r = s.

The column DepartmentID is a primary key for the table Department in the above examples.

Left outer join

The result of a left outer join for table A and B always contains all records of the “left” table (A), even if the join-condition does not find any matching record in the “right” table (B). This means that if the ON clause matches 0 (zero) records in B, the join will still return a row in the result—but with NULL in each column from B. This means that a left outer join returns all the values from the left table, plus matched values from the right table (or NULL in case of no matching join predicate).

Given the requirement that even if the Join cannot find a matching row in B for a row r in A, the row r remains part of the result, with Nulls filling all the b.cn columns representing B column names, this result cannot be part of Product(A, B). It follows that we must revisit and modify the code for the Function Product (or InnerJoin, if you did Exercise 0) in order to define LeftOuterJoin.

– – When cn0 is a primary key for B we can easily define/construct efficiently:

Function LeftOuterJoin(A: Table, B: Table, cn0) Returns Table =
Begin
Assert cn0 is a primary key for B
VAR AN: Set(CN) := {a.cn | cn In A.domain}, BN: Set(CN) := {b.cn | cn In B.domain}
VAR T: Table := { }
For r In A Do
..VAR t: AN + BN → V := { } – – Note each row r in A defines a new row t in the Join.
..For a.cn In AN Do If r!cn Then t(a.cn) := r(cn) Done – – this loads up the a.cn side of t.
..If r!cn0 Then
….Begin
….VAR s In B | s(cn0) = r(cn0) – – such s is unique since cn0 is a primary key for B
….For b.cn In BN Do If s!cn Then t(b.cn) := s(cn) Done – – this loads the b.cn side of t.
….End
..T += t – – add this newly defined row t to T, even if no matches occur in B
..Done
Return(T)
End – – LeftOuterJoin

– – Even when cn0 is NOT a primary key for B we can define a RightOuterJoin(A, B, cn0) which will add the rows of B for which there is no matching (on cn0) row in A.

Function RightOuterJoin(A: Table, B: Table, cn0) Returns Table =
Begin
VAR AN: Set(CN) := {a.cn | cn In A.domain}, BN: Set(CN) := {b.cn | cn In B.domain}
VAR T: Table := { }
VAR t: AN + BN → V
VAR foundNoMatchInA: Boolean
For s In B Do
..foundNoMatchInA := True
..For r In A Do
….If r!cn0 Then
….Begin
….t := { }
….For b.cn In BN Do If s!cn Then t(a.cn) := s(cn) Done – – loads b.cn side of t
….For a.cn In AN Do If r!cn Then t(a.cn) := r(cn) Done – – loads a.cn side of t.
….T += t
….foundNoMatchInA := False
….End
….Done – – since cn0 is not a primary key, look for more matching rows r.
..If foundNoMatchInA Then – – Add this row s anyway
….Begin
….t := { }
….For b.cn In BN Do If s!cn Then t(b.cn) := s(cn) Done – – t is Null on all a.cn
….T += t
….End
..Done
Return(T)
End – – RightOuterJoin

There is no standard SQL for Outer Joins, but each vendor modifies their SQL to include it, e.g.

SELECT * FROM Employee LEFT OUTER JOIN Department ON

a.DepartmentID = b.DepartmentID;

Example of a LeftOuterJoin: 

a.LastName

a.DepartmentID

b.DepartmentName

b.DepartmentID

Jones 33 Engineering 33
Rafferty 31 Sales 31
Robinson 34 Clerical 34
Smith 34 Clerical 34
John NULL NULL NULL
Steinberg 33 Engineering 33


Example of RightOuterJoin which drops the information about John having no department, but adds the information that Marketing has no people:

a.LastName

a.DepartmentID

b.DepartmentName

b.DepartmentID

Smith 34 Clerical 34
Jones 33 Engineering 33
Robinson 34 Clerical 34
Steinberg 33 Engineering 33
Rafferty 31 Sales 31
NULL NULL Marketing 35


When cn0 is NOT a primary key, on can still define a LeftOuterJoin'(A, B, cn0) to be equal to RightOuterJoin(B, A, cn0). Instead of getting exactly the number of rows of A for the join, one can get more.

Function OuterJoin(A: Table, B: Table, cn0) Returns Table =

Begin

Return(Union(LeftOuterJoin(A, B, cn0), RightOuterJoin(A, B, cn0)))

End – – OuterJoin

Example of OuterJoin(A, B) where the information contains both the fact that John has no department as well as Marketing having no people:

a.LastName

a.DepartmentID

b.DepartmentName

b.DepartmentID

Smith 34 Clerical 34
Jones 33 Engineering 33
Robinson 34 Clerical 34
John NULL NULL NULL
Steinberg 33 Engineering 33
Rafferty 31 Sales 31
NULL NULL Marketing 35

Exercises that give clues for implementation:

  1. Write out the code for LeftOuterJoin’ (the version that doesn’t assume cn0 is a primary key) without using RightOuterJoin.
  2. Write out the code for OuterJoin without using Union, LeftOuterJoin, or RightOuterJoin. Do it without the assumption that cn0 is a primary key for B, and then again when it is.

[1] Butler Lampson, et al, “Principles of Computer Science”, MIT Lecture Series, circa 1990.

Advertisements