RelaX - Help

Tutorial - user

The core concepts

The goal of this tutorial is to give a quick introduction how to use the relational algebra calculator and its concepts. It assumes that you already know the relational algebra or are learning it from other sources.

There is no real standard for the relational algebra like there is for SQL. So every book or teacher might have its slightly different interpretation and notation.
The goal of this progam was to support the most commonly used "mathematical" notation used by Database Systems The Complete Book 2nd Edition by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, Datenbanksysteme: Eine Einführung by Alfons Kemper and André Eickler and Wikipedia - Relational Algebra and others.

relations

The core element of the calculator is the relation (or table) which consists of a fixed number of attributes (or columns) in a fixed order (this is called the schema of the relation) and a set of tuples or rows containing the specific values.

Each attribute has three distinct properties: its type, its position and its name.

The type or domain of an attribute is either string, number, date or boolean.
The type is used for example to determine if two values can be compared in a boolean expression or if two schemas are union compatible. In most cases the type of the attributes are obvious if you look at the values.

The position of each attribute in a schema is fixed and can be used to adress the attributes.
An example would be the projection of the first and third attribute or column of an arbitrary relation R: π [1], [3] ( R )

The full qualified name of the attribute is a unique identifier of the attribute within the schema of its relation. It consists of the name itself and a relation qualifier and are written like in SQL as R.a where a is the name and R is the relation qualifier.
An example would be the projectoin of the attributes a, b from a relation R: π R.a, R.b ( R )
The default relation qualifier of each attribute is the name of its relation.
If the attributes name without the qualifier is unique within the relation's schema, it also can be used to address a specific attribute.
The previous example could also be written as π a, b ( R ).

Each relation has a set of tuples (or rows). This means that there are no duplicate tuples within one relation and the duplicate-elimination is implicitly executed after every single step of the calculation.
The tuples in the calculator have a defined order and unlike a normal database system all operations are implemented to preserve that order. This should help the users to see what has changed from one step to the next.

so far we covered that

  • relations are the core elements,
  • relations have a schema and a set of tuples,
  • each attribute in the schema has
    • data-type
    • a position
    • a fully qualified name (RELATION.attributename)
  • and that each attribute can be addressed within an operation using
    • its position e.g. π [1], [2] ( R ),
    • the attribute name e.g. π a, b ( R )
    • or its full quallified name if the unqualified is not unique e.g. π R.a, S.a ( R x S )

Alternative plain text notation

Before we introduce how to use the operators this should be a quick introduction of a very handy feature of the relational algebra calculator: the alternative plain text notation

The "classic" mathematical notation uses greek letters like π, σ for the unary operations and special symbols like the join symbol or the union symbol for some binary operations.
This symbols can be entered using the toolbar at the top of the editor.

This calculator also supports a alternative syntax for all this symbols that follows two very simple rules: Every greek letter can be substituted with its name spelled out ("pi" for π, "gamma" for γ) and every other symbol has an equivalent name that is borrowed from SQL, programming languages like C and Set theory.

This substitutions should be easy to read and much more important very easy to write because you don't need any toolbar or mouse. The calculator also supports autocomplete: just press [CTRL]+[SPACE] to complete the current keyword.
This feature should help you to write your statements more quickly and fluently.

π R.a, S.a, S.b σ R.a = S.a ∧ ( R.a > 5 ∨ R.a < 0 ) ( R ⨯ S ) is equivalent to: pi R.a, S.a, S.b sigma R.a = S.a and ( R.a > 5 or R.a < 0 ) ( R cross join S ) In the following table you can see a list of all supported substitutions:
classical notation alternative notation
π pi
σ sigma
ρ rho
τ tau
γ gamma
intersect
union
- \
÷ /
  • x
  • cross join
  • join
  • inner join
  • natural join
  • left join
  • left outer join
  • right join
  • right outer join
full outer join
left semi join
right semi join
  • anti semi join
  • anti join
<- pi new_name <- a ( R )
-> pi a -> new_name ( R )
  • schema preserving operations - operations where the resulting relation has the same schema as its first argument-relation:
    • selection
    • union
    • intersection
    • subtraction
    • orderby
    • left outer join
    • left semi join
    • anti join

Relational algebra

For this Part we use the "bank example" Dataset with 3 relations: Customers, Accounts and PremiumCustomers. By convention relations start with a uppercase letter and attributes with a lower case letter.

Open and inspect dataset

Open the dataset used in this tutorial using the following link to the "bank example" Dataset.

You find the relations and their attributes listed on the side and if you hover a relations name a preview of the first view tuples is displayed.

The most basic query

After you have found the Dataset you can formulate the very first and most basic query in relational algebra: a relation without any further manipulation.

Just enter the name of a relation into the code editor or click on the relation/attribute names to insert them into the code editor.
Note that the editor supports auto completing the relation/attribute names of the current dataset and the operators with [CTRL]+[SPACE]

So if you want all tuples of the relation Customer you should have the following statement: Customer. And get all the tuples if you press the execute button or press [CTRL]+[RETURN].

Unary operations

All unary operations have the same basic syntax FUNCTION ARGUMENT ( CHILD_EXPRESSION ).

The braces around the CHILD_EXPRESSION can be omitted. In this case the predefined operator precedence for relational algebra applies.

A complete list of the supported relalg operations can be found here: general syntax, unary operations and binary operations.

The projection is one of the basic operations that allow to choose which of the attributes of the parent relations should be included in the new one and in which order they should be.

Renaming a relation (ρ) changes the qualifiers of all the relations attributes but does not touch the tuples.

Renaming an attribute (ρ) only changes the name of a specific attribute (and leaves his relation-qualifier unchanged).

The statement pi balance ( Accounts ) returns a new relation with only the balance attribute.

The next statement gets the balance with the account-id after renaming the relation to A and renames one of the attributes.

rho account_number <- aid ( pi aid, A.balance ( rho A ( Accounts ) ) )

Like in SQL or most programming languages you can format your statement and use SQL like comments (with -- ... or /* ... */) to increase the readability.

The next statement uses a selection to filter the tuples of a relation based on a boolean expression. The calculator supports complex boolean expression with functions and built in operator precedence.
The attributes in the boolean expression can be given as name or numeric position like with the projection.

The next statement selects all customers-ids of customers who have the same value for their firstname and lastname.

-- this should return an relation with no tuples: pi cid ( sigma firstname=lastname ( Customers ) )

The next example uses a more complex expression to get all accounts with a balance over 100 or under -100.

sigma balance > 100 or (balance*-1 > 100) -- (balance < -100) would also be correct Accounts

As a shorter alternative you can use a function in your expression to get the same result:

sigma abs(balance) > 100 Accounts

Tutorial - maintainer

Everybody can provide datasets that can be used in the relational algebra calculator and share them with others.
We assume the scenario of a teacher wanting to provide a dataset for his/her students for this short tutorial.

The datasets are specified in a simple text format and can be shared with others via GitHub Gists (a simple and free platform to share snippets).
The shared gist gets an unique ID and the relational algebra calculator can load the dataset directly using this ID.

Creating a dataset

The fist step is the creation of the dataset which is actually only a group of relation definitions with some additional information and is therefore refered as group in the specified format.
The relations can then be used by the students to formulate the there statements.
Lets assume we want to create a dataset of employees of a company.

Every group definition starts with a simple header which (at least) contains the name of the group: group: bank example every header field starts with the name of the field and is followed by a colon for single line values. The next (optional) header field we should specify is the description. It should contain information like who is maintaining this dataset or where to find additional information.
In the description Markdown can be used to format the text or set links to external resources.

In our example we add a description that takes more than a single line and therefore we enclose the value in double brackets instead of using the colon.

group: bank example description[[ the data for this dataset was generated using <http://www.generatedata.com/> * the relation _Customers_ contains basic information about the customers of the bank. The relation _Accounts_ contains the basic information of a single account. Note that a customer can have any number of accounts. * the relation _PremiumCustomers_ contains the customer-ids of all customers with a total balance over 1000 ]]

The next step is to actually add the relations the students can use for their queries.
The relation definitions are use the relational algebra syntax that can be used in this tool.
Every relation is defined by a single variable assignment where the name of the variable is used as the relations name and the result of the expression defines the relation.
The relalg expression can use all features that can be used in the tool and can also use other relations defined within the same dataset.
Note that the name of the relation is used as the qualifier of each attribute/column.

For the relation Customers relation we use the inline relation syntax as the most basic method to define the relation and can also be edited using the relation editor which is a simple spread-sheet like editor.
The inline relations in combination with the editor should be very easy to use if you enter the data directly or if you have them as a csv/spread-sheet file.

group: bank example description[[ the data for this dataset was generated using <http://www.generatedata.com/> * the relation _Customers_ contains basic information about the customers of the bank. * the relation _Accounts_ contains the basic information of a single account. Note that a customer can have any number of accounts. * the relation _PremiumCustomers_ contains the customer-ids of all customers with a total balance over 1000 ]] Customers = { cid firstname lastname 1 Dexter Simpson 2 Kaseem Gallagher 3 Kuame Hamilton 4 Robert Thompson 5 Rhiannon Valentine 6 Calvin Mays } Accounts = { aid, cid, balance:number 1, 1, 66 2, 1, -304 3, 2, 272 4, 3, 3472 5, 4, 975.7 6, 4, 93 7, 5, 534 8, 5, -75.5 }

As we can see the name of the relations are defined by the assignments.
The inline-relations are enclosed by curly brackets and contain the names of the attributes/columns in the first line and then a tuple/row per line where the values are simply separated by whitespace. You can also other seperators and can define the types explicitly as we can see at the Accounts relation. For further information can be found at inline relation description.

The last relation we need to add in this example is the relation containing the banks premium Customers.
They are specified by using the other two relations in a simple relational algebra statement that selects all customers with a total balance over 1000.

group: bank example description[[ the data for this dataset was generated using <http://www.generatedata.com/> * the relation _Customers_ contains basic information about the customers of the bank. * the relation _Accounts_ contains the basic information of a single account. Note that a customer can have any number of accounts. * the relation _PremiumCustomers_ contains the customer-ids of all customers with a total balance over 1000 ]] Customers = { cid firstname lastname 1 Dexter Simpson 2 Kaseem Gallagher 3 Kuame Hamilton 4 Robert Thompson 5 Rhiannon Valentine 6 Calvin Mays } Accounts = { aid, cid, balance:number 1, 1, 66 2, 1, -304 3, 2, 272 4, 3, 3472 5, 4, 975.7 6, 4, 93 7, 5, 534 8, 5, -75.5 } PremiumCustomers = pi cid ( sigma sum > 1000 ( gamma cid; sum(balance)->sum ( Accounts join Customers ) ) )

We have now seen how to define a dataset with its header containing the name and the description followed by relational algebra assignments defining the relations of the dataset.

We can paste this definition directly in the Group Editor to load and use it.
In the next section we want to look at how we can publish this definition so that we give our students a single url or id to directly load the dataset.

Share a dataset

If we want to share the definition of a dataset with other people we could simply give them the definition and they load it using the group editor, but that would not be that practical for most cases.

The simpler solution (for the users) is to publish the definition as a GitHub Gist and share the ID of the gist with others.

Just create a gist with the definition as its content (the filename does not matter) and publish it. The ID of the Gist can now be found at the top of the page as gist:xxxxxxxxxxxx or in the url after the last slash.

This ID can then be shared and loaded in the interface or the calculator can be called directly with a specific ID by using using the parameter ?data=gist:xxxxxxxxxxxxxx.

For example the simple bank definition of this tutorial has been published as a Gist with the ID 2cfb981fbc5676182d64 and can therefore be loaded directly by appending ?data=gist:2cfb981fbc5676182d64 to the url of the calculator.

Reference - relational algebra

General syntax

assignment

syntax NAME = EXPRESSION

Defines a new local variable with the name NAME; its content is defined by EXPRESSION

The name of the new relation must be unique.

The definition has to be executed with the statement.

TestA = π a,b R TestB = σ d > 0 S -- statement using the variable TestA join TestB

An assignment (= definition of a variable) is no valid relational algebra expression on its own. If you miss the acutal query a error is thrown (Error: only assignments found; query is missing).

-- this is the definition of the variable Test = π Customer.firstname, surname ( Customer ) -- this is the actual query/statement using the variable Test

There is a known problem when the last assignment ends with a natural join and the query consists of a single relation:

A = S join R A -- this is the query

The statement is ambiguous and the parser interprets it as A = (S join R A) where R is interpreted as a column argument for the theta-join and therefore detects a cyclic usage of the variable A.

Solution: To get the expected behaviour you have to set braces around the assigned expression like A = (S join R) A

single-line comment

syntax -- COMMENT_TEXTEXPRESSION

the '--' must be followed by at least one whitespace charater! inserts a comment; its text goes until the end of the line

comments are recognized as whitespace

Test = -- This is the expression that is assigned to Test π Customer.firstname, surname ( Customer )

pre defined relation

syntax RELATION_NAME

Uses a pre defined relation with the name RELATION_NAME

The code completion only works for this relations.

( Customers ) cross join ( Accounts )

multi-line comment

syntax /* COMMENT_TEXT */ EXPRESSION

inserts a comment that can span multiple lines

comments are recognized as whitespace

/* This is a very very long comment */ Test = π Customer.firstname, surname ( Customer )

inline-relation

syntax {COLUMN_NAME_1:COLUMN_TYPE_1 ... ROW_1 ROW_2 ... }

The inline-relation is a temporary relation that can be defined directly in the statement. It is only valid in the defining statement

Every inline-relation is a valid expression and thus can be used at any position a EXPRESSION is expected.

The inline-relation is defined by a header, that specifies the schema of the relation and the rows containing the values and is surrounded by curly brackets.

The header is defined by a sequence of QUALIFIER.COLUMN_NAME:COLUMN_TYPE separated by any whitespace, comma or semicolon. The QUALIFIER is optional. Also the COLUMN_TYPE can be omitted if the type is well defined by the values of that column. The first non null value of a column defines its type.
True and false (case insensitive without quotes) are reserved for a boolean type. They can be used as a simple string but they do not define the type of the column as string.
The COLUMN_TYPE can be one of the following
  • string
  • number
  • date
  • boolean

The rows of the relation are defined by a list of values per row with the type of the coresponding column. The values are separated by whitespace comma or semicolon.
Simple strings only containing letters, numbers, hyphens, underlines, dots or periods ([0-9a-zA-Z\-_\.]+) can be written without single quotes. NULL and null are constant values. If null, true or false should be used as string they have te be quoted.
More complex strings must be surrounded by single quotes: 'content' or content but '' or 'long content containing spaces and special characters like !' or 'null'.
Dates are written in ISO-format: YYYY-MM-DD without single quotes
A null-value can be written as null or NULL (without single quotes).
Numbers could be integers in the form (-?[0-9]+) or floats in the form (-?[0-9]+\.[0-9]+).
Numbers in single quotes are recognized as numbers if the column type is defined as number or has been detected to be number from a previous value; otherwise it will be a string value..
A boolean value is denoted as either true or false (case insensitive).

The header and rows can be indented if needed.

-- type for column b is defined by the first value rho A {a:number, b 1, 2 3, 4 } cross join { a:string X.b:date c:number Alpha 1970-01-01 1 'Beta 2' 1970-01-02 3 '' 1970-01-03 4 }

relational algebra expression

A valid relational algebra expression is built by connecting relation-name or inline-relation as atoms with the defined unary and binary operators.

So a relational algebra expression is recursively defined as follows:

Unary operations

Each unary operation follows the following syntax:
FUNCTION ARGUMENT ( CHILD_EXPRESSION )
The parentheses are Optional.

projection

symbol π
alternaive syntax pi
example pi a, b ( R )

The argument is a subset of columns of the schema of the CHILD_EXPRESSION or a value expression

π Customer.firstname, surname ( Customer )
pi c.id, [1] ( ρ c ( Customer ) )
Expressions can be used to create more complex statements using one or more columns of a single row.
pi c.id, lower(username)->user, concat(firstname, concat(' ', lastname))->fullname ( ρ c ( Customer ) )
The virtual column ROWNUM used in previous versions is not supported any more but the rownum() expression can be used to get the same information. And it can also be used directly in the boolean condition of a selection or join.
In this example the top 5 customers with the most orders are selected, where countOrders could be the result of a previous aggregation. pi firstname, lastname sigma rownum() <= 5 tau countOrders desc Customer

selection

symbol σ
alternaive syntax pi
example sigma a > 2 ( R )

The argument is a boolean expression that each row of CHILD_EXPRESSION is checked on

σ firstname = 'Bob' or firstname = 'Alice' ( Customer )
σ (id > 10 and id < 100) or id = 42 ( Customer )
Selecting all customers with a firstname that has an even length. σ mod(length(firstname),2) = 0 ( Customer )

rename relation

symbol ρ
alternaive syntax rho
example ( R ) join R.a = X.b (rho X ( R ))
The argument is the new name for the Relation returned by CHILD_EXPRESSION
rename the Relation from "Customer" to "a": π a.id, a.firstname ( ρ a ( Customer ) )

rename column

symbol ρ
alternaive syntax rho
example the old and the new column names in a list (see example)
"←" can be substituted with "<-" pi x, b rho a->x {a, b 1, 2 3, 4 }
The argument is the old and the new column names in a list (see example)
"←" can be substituted with "<-"
rename the columns id and firstname to myId and foobar: ρ myId←id, foobar←firstname (π id, firstname ( Customer ) )

order by

symbol τ
alternaive syntax tau
example tau a asc, b desc ( R )
The argument is a list of columns by which the relation should be ordered (see examples)
order the result by the first column (default is ascending) and the second column descending: τ [1], firstname desc (π id, firstname ( Customer ) )

group by

symbol γ
alternaive syntax gamma
example gamma a, count(*) ( R )
The argument is a list of columns to group by, separated by commas followed by a semicolon
and a list of aggregate functions to apply with their new name in form AGG( COLUMN ) -> NEW_NAME
order the result by the first column (default is ascending) and the second column descending: γ a, b ; sum(c)->x ( Customer )

If no grouping columns are provided the entire relation is the group.

supported aggregate functions by type:
  number string date
COUNT( * ) yes yes yes
COUNT( column ) yes yes yes
MIN( column ) yes yes yes
MAX( column ) yes yes yes
SUM( column ) yes no no
AVG( column ) yes no no

Binary operations

Each binary operation follows the following syntax:
( CHILD_EXPRESSION ) FUNCTION ARGUMENT ( CHILD_EXPRESSION )
The parentheses are Optional.

intersection - ∩

symbol
alternaive syntax intersect
no argument
( Customer ) ∩ ( Customer )
the schemas must be unifiable

union - ∪

symbol
alternaive syntax union
no argument
( Customer ) ∪ ( Customer )
the schemas must be unifiable

division

symbol ÷
alternaive syntax /
no argument
( Customer ) ÷ ( Customer )
the schemas must be unifiable

subtraction / set-difference

-
alternaive syntax \
except
no argument
( pi firstname ( Customer ) ) - ( rho test<-lastname ( pi lastname ( Customer ) ) )
the schemas must be unifiable

cross product

symbol
alternaive syntax cross join
no argument

Theta-join / θ-join

symbol
alternaive syntax join
inner join
join condition

Special case:
The name of a single boolean column (like R join a S) can not be used directly as a join condition due to ambiguities in the relational algebra syntax.
The column must either be specified with its qualifier (R join R.a S) or wrapped in parentheses (R join (a) S).
This is not necessary for more complex boolean expressions. The problem is only that the single column name can not be distinguished from a relation name: the expression X=R join S X could be interpreted as A=(R join S A) instead of A=(R join S) A.

natural join

symbol
alternaive syntax join
natural join
no argument
( ρ a ( Customer ) ) a.name < b.name ( ρ b ( Customer ) )

left outer join

symbol
alternaive syntax left outer join
left join
optional join condition; if no join condition is given it acts as a natural left outer join

right outer join

symbol
alternaive syntax right outer join
right join
optional join condition; if no join condition is given it acts as a natural right outer join

full outer join

symbol
alternaive syntax full outer join
optional join condition; if no join condition is given it acts as a natural full outer join

left semi join

symbol
alternaive syntax left semi join
no argument

right semi join

symbol
alternaive syntax right semi join
no argument

anti semi join

symbol
alternaive syntax anti semi join
anti join
no argument

Operator precedence

The operator precedence allows to obmit most of braces.
The used precedence is shown in the table below.
All operators are left associative.

order of precedence Operator
0 relation-name, inline-relation
1 projection, selection, rename (columns), rename (relations), group, order by
2 cross product, theta join, natural join, left outer join, right outer join, full outer join, left semi-join, right semi-join, anti semi-join, division
3 intersection
4 union, subtraction
A join B x C
is equal to
((A join B) x C)
because the cross product and the natural join are in the same precedence class.
sigma true A join sigma true B
is equal to
(sigma true (A)) join (sigma true (B))
because the unary operators have a higher precedence than the binary operators.

Misc

Column

column is either
  • the name of a column: "columnName"
  • the number of the column (starting with 1): "[column-number]"
  • a full qualified column: "qualifier.columnName"
  • a value expression (if allowed for the specific operation)
the qualifier is optional if the column is unique in its schema.

Value expressions

With most operators you can use a value-expression which connects one or more columns of a single row to calculate a new value. This is possible for:
  • the projection creating a new column (make sure to give the column a name)
  • the selection any expression evaluating to boolean can be used
  • for the joins any expression evaluating to boolean can be used; note that the rownum() expression always represents the index of the lefthand relation
The expressions always operate on a single row/tuple of a table/relation.
If you want to calculate values vertically over all values of a specific column/attribute you need to use group by with an aggregate function. The following table lists all functions and operators that can be used in an expression. They can be combined and nested in any arbitrary order but note that they do evaluate to a single type defined by the outer most expression. The following operators can be used:
syntax returns type description
a AND b boolean logical and
a OR b boolean logical or
a XOR b boolean logical exclusive or
NOT b boolean logical not
a = b a != b a < b a > b a <= b a >= b a != b boolean compares two values of the same type
a:string LIKE 'PATTERN' boolean returns true if expression evaluating to a string a matches the pattern given as the second operand.
The pattern has to be given as a string literal;

An underscore (_) in the pattern stands for any single character and any percent sign (%) stands for any sequence of zero or more characters.

a:string ILIKE 'PATTERN' boolean same as LIKE but matches case-insensitive.
This is not in the SQL standard but is a PostgreSQL extension.
a + b a - b a * b a / b a % b number arithmetic addition, subtraction, multiplication, division, modulo
rand() number returns a random number in the interval [0, 1]
rownum() number returns the index of the current row (starting with 0).
If the function is used in a binary relational algebra expression (e.g. a join) it always represents the index of the left operand.
length(a:string) number returns the length of the string
date(a:string) date parses the given string to a date object. The string must be in the form YYYY-MM-DD
adddate(a:date, b:number) date adds the given number of days to date a
subdate(a:date, b:number) date subtracts the given number of days from date a
now() transaction_timestamp() statement_timestamp() date returns a timestamp representing the start of the query execution
transaction- and statement start are the very same value due to the lack of a transaction concept
clock_timestamp() date returns the current timestamp while executing the query
year(a:date) number returns the year component of a given date
month(a:date) number returns the month component of a given date as a number. (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
day(a:date) dayofmonth(a:date) number returns the day component of a given date as a number in the range 1 to 31
hour(a:date) number returns the hour component of a given date as a number in the range 0 to 23
minute(a:date) number returns the minute component of a given date as a number in the range 0 to 59
second(a:date) number returns the second component of a given date as a number in the range 0 to 59
concat(a:string [, ...]) string concatenates the given strings
upper(a:string) ucase(a:string) string converts the given string to upper-case
lower(a:string) lcase(a:string) string converts the given string to lower-case
strlen(a:string) number number of characters of the string
abs(a:number) number the absolute value of the given number
add(a, b) sub(a, b) mul(a, b) div(a, b) mod(a, b) number arithmetic addition,
subtraction,
multiplication,
division or
modulo of the given numbers
round(a) floor(a) ceil(a) number round to nearest integer,
largest integer not greater than the argument or
smallest integer not less than the argument
coalesce(value [, ...]) type of value returns the first of its arguments that is not null or null if all arguments are null. Note that all arguments must have the same datatype.
CASE WHEN condition THEN result [WHEN ...] [ELSE result] END type of result returns the first result where the condition evaluates to true. If all conditions are false the else part is executed or null is returnt if the else part is missing. Note that all results must have the same datatype.
The operator precedence is the same as used in MySQL (from strong to weak):
order of precedence Operators
0 functions, constants, columns
1 ! (boolean not)
2 - (unary minus)
3 *, /, %
4 -, +
5 = (comparison), >=, >, <=, <, <>, !=, LIKE, ILIKE
6 CASE, WHEN, THEN, ELSE
7 AND
8 XOR
9 OR, ||

Reference - SQL

The goal of the SQL mode of the relational algebra calculator is to provide a translation from SQL to relational algebra to show how they are related. It does not support all features a real database system like PostgreSQL or MySQL does because the goal is to provide a translation into relational algebra. This means that many features like correlated-substatements are not supported because the translation into relational algebra is not trivial and modern database systems use an extended set of operators internally that do not require a one-to-one translation into "classical" relational algebra. Therefore the learning effect for users of this tool would not be that big.

General syntax

All keywords are case insensitv.

The following Synopsis is a adapted version of PostgreSQL and shows the general syntax of the supported SQL. Brackets indicate optional parts. Braces and vertical lines indicate that one of the alternatives has to be chosen. Dots mean that the preceding element can be repeated.

[ WITH with_query [, ...] ] SELECT [ DISTINCT ] * | expression [ [ AS ] output_name ] [, ...] FROM from_item [, ...] [ WHERE condition ] [ GROUP BY column [, ...] ] [ HAVING condition ] [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ] [ ORDER BY column [ ASC | DESC ] [, ...] ] [ LIMIT { count | ALL } ] [ OFFSET start [ ROW | ROWS ] ] [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] where from_item can be one of: table_name [ AS alias ] with_query_name [ AS alias ] ( select ) AS alias from_item CROSS JOIN from_item from_item NATURAL JOIN from_item from_item [ INNER ] JOIN from_item ON join_condition from_item [ INNER ] JOIN from_item NATURAL from_item [ INNER ] JOIN from_item USING ( join_column [, ...] ) from_item from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN ON join_condition from_item from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN NATURAL from_item from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN USING ( join_column [, ...] ) from_item and with_query is: with_query_name AS ( select )

Semantic and Translation to relational algebra

Sequence of execution

The SQL statement is translated directly into relational algebra. To understand some of the effects of the tool it might be helpful to understand the steps of the translation process. As mentioned before, real database systems might take a different (more complex) aproach but this should help to get an idea of how SQL could be translated into "classical" relational algebra.

The following list shows the translation from SQL into relational algebra starting with the inner most relational algebra expression at the top.

  1. replace all usages of the temporary-tables defined in the WITH-clause with their definitions
  2. FROM-clause is translated left associative with all joins having the same precedence (, is a cross join)
  3. selection with condition from where-clause is added
  4. group by
  5. selection with condition from having-clause
  6. union/intersect/except
  7. projection is added to choose the requested columns of the SELECT-clause
  8. the columns are renamed to get the requested output-name specified in the SELECT-clause
  9. order by
  10. limit/offset is mapped to a selection

SELECT

The direct translation into relational algebra with implicit duplication elimination requires the distinct keyword to be equivalent. A warning is shown if you omit it.

The select-clause is translated into up to two relalg operators.

  • The most basic case is select * ... where no changes are made to the schema of the result. Therefore no projection is needed. You can use the optional table-alias-prefix if the columns of a single table/relation should be returned only: select distinct R.* from R inner join S
  • When a subset of the columns are selected and/or reordered ( select a, b from ...) then a single projection is used.
  • Additionally to the previous case a rename-column operator is added after the projection if new output-names are given with as foo. e.g. select a as foo from ...

The allowed expressions are the same as for the projection. So it can be either the name of the column (with optional renaming using as) or a complex value expression. In the latter case a output-name must be given using as because the tool requires every column to have a name.

FROM

In its simplest form the from-clause holds a single table/relation name that is used directly in the relalg statement. If the optional table-alias is specified with table as foo this is reflected by wrapping the relation in a rename-relation operator with the given output-name.

select distinct x.a, x.b from R as x

Joins can be expressed using the ANSI join syntax

select distinct * from A, B inner join C on ... inner join D natural natural join E left outer join F on ... left outer join G natural right outer join H on ... right outer join I natural full outer join J on ... full outer join K natural where ...

The comma is part of the old join syntax and is translated into a cross join.

select distinct * from R, S as s, T where s.a = R.a

Instead of the name of relation a non-correlated substatement can be used. A table alias must be provided with (...) as foo.
A non-correlated substatement can be directly translated into relational algebra by just translate the sub-statement into relational algebra and use the resulting operator tree instead of the relation.

Non-correlated means that it must not reference/use any columns of tables defined in the outer scope.
This limitation is intentionally because the translation into relational algebra is not trivial and modern database systems use an extended set of operators internally that do not require a one-to-one translation into "classical" relational algebra. Therefore the learning effect for users of this tool would not be that big.

select distinct * from R, (select * from S where a > 0) as x where x.a = R.a

WHERE

The boolean condition in the where-clause can be any expression evaluating to boolean.

The where clause is directly translated to an relational algebra selection with the very same condition. This selection is applied after joining relations of the from-clause therefore has to use the original column names.

Subquery Expressions like EXISTS, IN, ANY/SOME or ALL are not supported because their translation into relational algebra is not trivial and modern database systems use an extended set of operators internally that do not require a one-to-one translation into "classical" relational algebra. Therefore the learning effect for users of this tool would not be that big.

GROUP BY

The GROUP-BY-clause takes a list of column names only argument.

The GROUP-BY-clause is directly translated to the relational algebra group-by operation and is executed directly after the selection built from the WHERE-clause and before the projection/renaming build from the SELECT-clause. Therefore the column names that can be used are the ones available after all joining all tables.

The aggregations used in the relational algebra group-by operation are taken from the SELECT-clause and an output-name must be given using as because the tool requires every column to have a name.

If no aggregations are present in the SELECT-clause a projection is used instead of the group-by operation because sigma without aggregation has the very same effect.

Every non-aggregation-column in the SELECT-clause must be present in the group by clause because the would not be available after the grouping.

HAVING

The HAVING-Clause represents an optional relational algebra selection. The boolean condition can be any expression evaluating to boolean.

The resulting selection is executed directly after the relational algebra group-by operation.

Unlike PostgreSQL the HAVING-clause is only allowed when either a aggregation or grouping is present.

ORDER BY

Order by takes a list of column names or indices of columns (starting with 1) as its argument.

It is directly translated to the extended relational algebra operation order by (tau).

LIMIT

The LIMIT-clause can be either specified with the LIMIT-OFFSET syntax used by PostgreSQL and MySQL or the FETCH-FIRST syntax introduced in SQL:2008.

It is translated into a relational algebra selection using the rownum()-function to limit the number of rows returned.

UNION / INTERSECT / EXCEPT

The Set-Operators UNION, INTERSECT and EXCEPT directly map to the relational algebra operators union, intersection and subtraction.

The keyword DISTINCT is optional because it represents the default behavior. The keyword ALL is ignored and a warning is shown because the targeted relational algebra has a implicit elimination duplicate rows.

Parentheses can be used to create more complex statements:
( select distinct * from S union select distinct * from T ) except select distinct * from T order by 1 limit 1

WITH

The WITH-clause (also known as common table expressions) provides a way to define subqueries for single or multiple use in a statement. This can be thought as defining a temporary table for that query in SQL terminology or creating variables in relational algebra. Recursive evaluation is not supported.

Each subquery can be referenced by the name from the WITH-clause. The subquery is automatically renamed to the name used in the WITH-clause.

Value expressions

Value expressions are used for boolean expressions for WHERE- and HAVING-clause, the boolean conditions of joins and calculated values in the SELECT-clause. The type of a expression is either string, number, date or boolean and is determined by the used operations and columns.

The supported functions and operations are the same for SQL and relational algebra: value expression