Last updated: August 28, 2025
In the last note, we briefly touched on the relational data model and
introduced some basic SQL syntax in the SELECT FROM WHERE format. In
this note, we’ll revisit the relational data model more formally with relational algebra,
which provides the foundation for many data systems, such as SQL.
Set Relational Algebra¶
An algebra is a mathematical theory that defines formulaic operations on variables of specific domains, thereby determining mathematical properties in those domains. For example, elementary algebra defines operations on arithmetic variables, linear algebra defines operations on vectors and matrices, and set algebra defines operations on sets. Relational algebra defines operations on relations.
The relational algebra theory was popularized by Edgar F. Codd’s 1970 work [1] in defining the relational model, which formed the theoretical basis for relational databases and informed the concurrent development of relational database management systems (rDBMSes).
We introduce relational algebra in the context of set relational algebra (set RA). In this setting, a relation has a set of attribute names (unordered, no duplicates) and its instances are sets of tuples (unordered, no duplicates). At the conclusion of this chapter, we extend our discussion to bag RA and why operations on bags are generally preferred in rDBMSes.
Primitive RA Operations¶
In classical set relational algebra, there are six primitive operators from which all other operators can be derived:
Projection
Selection
Renaming
Product
Union
Difference
These operations take as operands one or more input relations and produce an output relation.
Unary Operators: Projection, Selection, Renaming¶
We first discuss the three primitive unary operators, where each operator takes as input one operand relation with schema and outputs an output relation . The output relation can be unnamed.
Projection, .
The projection operator outputs a relation that contains the tuples of the input relation , but restricted to set of attributes . In other words, the output relation’s attributes is a subset of the input relation’s attributes , where .
Selection, .
The selection operator outputs a relation that contains the tuples of input relation that satisfy a predicate . The predicate is defined to contain attributes with boolean expressions: , , , , AND (), OR (), NOT (), etc. Because selection is a row filter, the input and output schema share the same schema, with set of attributes .
Renaming, , equivalently .
The renaming operator outputs a relation with the same data as the input relation, but with renamed attributes and/or relation name. The number of attributes, therefore stays consistent across the input and output relation, and the data in tuples does not change. We provide two notations above. In the former, we denote the names of the full set of attributes in the output relation, including the unchanged attribute names. In the latter, we specify the attributes to rename, where input relation attribute is renamed to attribute , for , and . In both notations above, the output relation name is optional.
In terms of a SFW SQL query, the selection operator maps to WHERE,
the projection operator maps to SELECT, and the renaming operator maps
to AS.
Examples¶
Suppose that we have two relations with the following schema:
outputs a relation with tuples of
titlesthat are restricted to the attributestitle_idandprimary_title. In other words, we “drop” the other attributestypeandruntime_minutes.outputs a relation with tuples of
peoplethat satisfy the predicateborn > 1980.outputs a relation named
personsthat has the tuples ofpeoplebut renames attributesbornanddiedtobirthanddeath, respectively, and keepsperson_idandnameunchanged.does the same as the previous example but is more concise.
Binary Operators: Cartesian Product, Union, and Difference¶
We next discuss the remaining three primitive operators, which are all binary operators. Each binary operator takes as input two operand relations and with schemas and , respectively, and outputs an unnamed output relation .
Product (or Cross Product, Cartesian Product), .
The product operator outputs a relation with tuples that associates each tuple in the left operand with each tuple in the right operand. The output schema is therefore . Because a relation’s attributes are defined on a set, if for attributes and , then the output relation renames attributes and to and , respectively. If and have and rows, respectively, then the output relation has rows.
Union, .
The union operator outputs a relation that has the set union of rows in and . In other words, contains one of every tuple that is in either and/or . For the union to be defined, the two input relations must have the same schema, i.e., and there exists some distinct mapping for . The output relation therefore also shares the schema .
Difference, .
The difference operator outputs a relation that has the set difference of rows in and . In other words, contains one of every tuple that is in but not in . If no tuples are unique to , the result is the empty relation. As with unions, the input relations and output relation must share the same schema.
Derived Operations¶
Other operations can be expressed as derivations of the operations we’ve listed above.
Intersection, : Defined as the set intersection of rows in and . This is a composition of union and difference; we leave the derivation as an exercise for you.
Joins¶
Several joins can be expressed as a special case of Cartesian product and the three unary operators. We assume the same definition of input relations as above.
Theta join, .
The theta join outputs a relation that joins two relations such that each row satisfies the condition . It is defined as . Because it is defined with a cross product, the output schema distinguishes common attributes in and in as and , respectively.
Equi join.
The equi join is a special case of the theta join, where is an “equality condition,” i.e., contains only boolean expressions involving equality () and logical AND ().
Natural join, .
The natural join outputs a relation that joins the two input relations and such that rows are matched on common attributes and removes duplicate attributes in the output relation schema:
Cross product of and .
Selection with equality condition , where filters out rows for which shared attributes do not have matching values. In other words, for each attribute pair where in and in have the same attribute name (“common attributes”), only keep the tuple if its values for and match.
Rename one set of common attributes back to their original name, e.g., is renamed to for each .
Drop the other set of common attributes, e.g., drop .
Here, it is easier to illustrate the natural join with an example. Suppose we have the two relations:
The natural join of crew and people would then satisfy:
.
Special cases¶
If and share identical schemas,
If and have no attributes in common,
Bag Relational Algebra¶
So far, we have defined relational algebra using set semantics: each relation is a set of tuples (no duplicates). In practice, however, many data systems (including SQL and pandas) use bag semantics, where both attributes and tuples are unordered collections that could contain duplicates.
Working with bags is often easier and faster than working with sets, since the system does not need to check for and remove duplicate tuples. Further more, bags are meaningful depending on the data domain; in other words, it may be meaningful for a relation to contain multiple copies of the same tuple.
The below table is a brief comparison of bag RA and set RA operators. We define “# of occurrences” as the sum total of duplicate tuples in the output relation.
| Operator | Bag operation | Performance compared to Set RA | Comments |
|---|---|---|---|
| Selection | Preserve # of occurrences | Roughly as fast | |
| Projection | Preserve # of occurrences | Faster than set | SQL SELECT is a common operation |
| Product | Preserve # of occurrences | Roughly as fast | |
| Union | Add # of occurrences | Faster than sets | |
| Difference | Subtract # of occurrences | Roughly as fast | |
| Renaming | - | - | Unchanged |
Helpful Resource: Visualizing Relational Algebra¶
Understanding relational algebra and SQL becomes much easier when you can visualize how each operator transforms its input relations. Here’s a helpful resource.