<aside>
💡 Design Decisions
#1: Hash Function
- How to map a large key space into a smaller domain.
- Trade-off between being fast vs. collision rate.
#2: Hashing Scheme
- How to handle key collisions after hashing.
- Trade-off between allocating a large hash table vs. additional instructions to find/insert keys.
</aside>
1. Hash Functions
We do not want to use a cryptographic hash function for DBMS hash tables. We want something that is fast and ****has a low collision rate.
- CRC-64 (1975)
- Used in networking for error detection.
- MurmurHash (2008)
- Designed as a fast, general-purpose hash function. Google
- CityHash (2011)
- Designed to be faster for short keys (<64 bytes).
- Facebook XXHash (2012)
- From the creator of zstd compression.
- Google FarmHash (2014)
- Newer version of CityHash with better collision rates.
2. Static Hashing Schemes
2.1 Linear Probe Hashing
Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.
- To determine whether an element is present, hash to a location in the index and scan for it.
- Must store the key in the index to know when to stop scanning.
- Insertions and deletions are generalizations of lookups.
<aside>
💡 Deletion
- Tombstone → Put a tombstone at the place where an entry was deleted → a waste of space
- Movement → Move other entries to fill up the space → quite complicated, need to be careful
</aside>
<aside>
💡 Non-Unique Keys
- Seperate Linked List → Store values in separate storage area for each key
- Redundant Keys → Store duplicate keys entries together in the hash table
</aside>