1. Plan Cost Estimation

<aside> 💡 Statistics

The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog.

Different systems update them at different times.

</aside>

For each relation R, the DBMS maintains the following information:

Selection cardinality $SC(A,R)$ → the average number of records with a value for an attribute $A$ given $N_R / V(A,R)$

assumes data uniformity where every value has the same frequency as all other values.

The selectivity (sel) of a predicate P is the fraction of tuples that qualify.

<aside> 💡 Assumptions are often not satisﬁed by real data.

Assumption #1: Uniform Data

Assumption #2: Independent Predicates

Assumption #3: Inclusion Principle

The domain of join keys overlap such that each key in the inner relation will also exist in the outer table. </aside>

1.1 Histogram

Real data is often skewed and is tricky to make assumptions about. However, storing every single value of a data set is expensive.

One way to reduce the amount of memory used by storing data in a histogram to group together values.

Untitled

Approach #1: Equi-Width Histogram

Combines together the counts for adjacent keys to reduce the storage overhead.

Untitled

Approach #2: Equi-Depth Histogram (Quantiles)

To ensure that each bucket has roughly the same number of counts, the histogram varies the range of each bucket

Untitled

DBMS’s can use sampling to apply predicates to a smaller copy of the table with a similar distribution.

The DBMS updates the sample whenever the amount of changes to the underlying table exceeds some threshold

Untitled