1. Joins
1.1 Operator Output
Multiple approaches to the contents of the join operator output:
- Data → Early Materialization
- Advantage: future operators in the query plan never need to go back to the base tables to get more data
- Disadvantage: requires more memory to materialize the entire tuple
- Record Ids → Late Materialization
- Ideal for column stores
- Problematic when data are store in distributed system
1.2 Cost Analysis
Only I/Os from computing the join are considered. Because the output for any algorithm will be the same, so the output cost will not change among different algorithms.
Variables used in this lecture:
- M pages in table R (Outer Table), m tuples total
- N pages in table S (Inner Table), n tuples total
At a high-level, this type of join algorithm is comprised of two nested for loops that iterate over the tuples in both tables and compares each unique of them. If the tuples match the join predicate, then output them.
- Outer and Inner
- Outer Table → the table in the outer for loop
- Inner Table → the table in the inner for loop
- The DBMS will always want to use the “smaller” table as the outer table.
- Smaller can be in terms of the number of tuples or number of pages.
- The DBMS will also want to buffer as much of the outer table in memory as possible.
- It can also try to leverage an index to find matches in inner table.
2.1 Simple Nested Loop Join
For each tuple in the outer table, compare it with each tuple in the inner table.
This is the worst case scenario.