Just like it cannot assume that a table fits entirely in memory, a disk-oriented DBMS cannot assume that query results fit in memory.

1. External Merge Sort

The standard algorithm for sorting data which is too large to ﬁt in memory is external merge sort. It is a divide-and-conquer sorting algorithm that splits the data set into separate runs and then sorts them individually. It can spill runs to disk as needed then read them back in one at a time. The algorithm is comprised of two phases:

<aside> 💡 Phase #1 – Sorting: First, the algorithm sorts small chunks of data that ﬁt in main memory, and then writes the sorted pages back to disk.

Phase #2 – Merge: Then, the algorithm combines the sorted sub-ﬁles into a larger single ﬁle.

</aside>

<aside> 💡 Sorted Run

Run → list of key/value pairs.

Key: The attribute(s) to compare to compute the sort order.
Value: Two choices
- Tuple (early materialization).
- Record ID (late materialization). </aside>

1.1 2-Way External Merge Sort

Data is broken up into N pages.

The DBMS has a finite number of B buffer pool pages to hold input and output data.

Number of passes: $1 + \lceil \log_2{N} \rceil$
Total I/O cost: $2N \cdot (\text{\# of passes})$

Only requires three buffer pool pages to perform the sorting (B=3).

→ Two input pages, one output page

Untitled

<aside> 💡 Double Buffering Optimization

Prefetch the next run in the background and store it in a second buffer while the system is processing the current run.

Reduces the wait time for I/O requests at each step by continuously utilizing the disk.

</aside>

1.2 General External Merge Sort

Pass #0
- Use B buffer pages
- Produce ⌈ N / B ⌉ sorted runs of size B
Pase #1,2,3…
- Merge B-1 runs (i.e., K-way merge)
Number of passes: $1 + \lceil \log_{B-1}{\lceil N / B \rceil} \rceil$
Total I/O cost: $2N \cdot (\text{\# of passes})$

1.3 Using B+Trees for Sorting

Clustered

Unclustered

2. Aggregations

An aggregation operator in a query plan collapses the values of one or more tuples into a single scalar value