1. Plan Cost Estimation


<aside> 💡 Statistics

The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog.

Different systems update them at different times.

</aside>

For each relation R, the DBMS maintains the following information:

Selection cardinality $SC(A,R)$ → the average number of records with a value for an attribute $A$ given $N_R / V(A,R)$

The selectivity (sel) of a predicate P is the fraction of tuples that qualify.

<aside> 💡 Assumptions are often not satisfied by real data.

Assumption #1: Uniform Data

Assumption #2: Independent Predicates

Assumption #3: Inclusion Principle

1.1 Histogram

Real data is often skewed and is tricky to make assumptions about. However, storing every single value of a data set is expensive.

One way to reduce the amount of memory used by storing data in a histogram to group together values.

Untitled

Approach #1: Equi-Width Histogram

Combines together the counts for adjacent keys to reduce the storage overhead.

Untitled

Approach #2: Equi-Depth Histogram (Quantiles)

To ensure that each bucket has roughly the same number of counts, the histogram varies the range of each bucket

Untitled

1.2 Sampling

DBMS’s can use sampling to apply predicates to a smaller copy of the table with a similar distribution.

The DBMS updates the sample whenever the amount of changes to the underlying table exceeds some threshold

Untitled

2. Plan Enumeration