Data Structures and Algorithms (CSCI 232)

Lecture 1: Hash tables

Introduction to Hash Tables

Hash tables: An efficient data structure for storing key/value pairs
Hash functions: mapping keys to integer values
Useful hash function properties: Determinism, efficiency, & uniformity of hashes
Collision-handling schemes: Separate chaining & Linear probing
Load factor & hash table resizing

Let's say you drink a lot of coffee

Some people drink a lot of coffee (like me), and some people drink tea, and some people don't drink any caffienated beverages at all.

If you've never thought about _why_ this is, you are in for a treat!

Inside your DNA there are genes that conspire to make you do this!

Changes in the DNA of your genome can make you taller, give you a rosier disposition, or make you sneeze in bright sunlight. And changes in some parts of your genome influence coffee consumption propensity.

There are dozens of coffee consumption-associated genes and you might have some, none, or all of them!


            # Python flavored pseudocode
            for SNP in my_genome:
              if SNP is coffee_related:
                  my.coffee_consumption += 1
              else:
                  pass on coffee

How do we find coffee consumption genes in our own genomes?

There are 100,000+ genetic variants (Single-nucleotide polymorphisms; SNPs) in the SNPedia database, all of which have SNP IDs and genomic coordinates (where they live in the genome). If we want to process information about these key/value pairs (i.e. SNPs & coordinates) how can we store this information efficiently?

Given a SNP ID we want to find its location efficiently

Data Structures we have already discussed

Linked Lists: Good old linked lists. Of course could use these but they are slow, with searching, and arbitrary insertion/deletion O(n)

Binary Search Trees: Good old BST are better, with insert/delete/search of O(log n).

The Gold Standard: Constant time operations O(1)

We know that array indexing is O(1), so we could store values in an array, if only we could figure out how to transform arbitrary keys (like SNP IDs) into integer value appropriately scaled to the size of our array.

🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔

Hash Tables: Efficiently mapping key/value pairs

It turns out this is _exactly_ what Hash Tables are designed to do! They combine the magic of hash functions with the efficiency of array access to create a practical and efficient data structure.

The primary components of a hash table are:

The hash function
A data storage structure (An array of 'buckets')
A Collision Resolution Scheme

What is a hash function?

In the simplest terms it is a function that maps arbitrary input data to an integer value.

Useful hash function properties: Determinism

A good hash function must be deterministic, meaning it consistently produces the same hash code for a given input. This ensures that you get the same result every time you hash a particular key, essential for accurate data retrieval.

Useful hash function properties: Efficiency

The hash function should be computationally efficient, meaning it should quickly compute the hash code for any given key. An efficient hash function is important for ensuring overall hash table peformance.

Useful hash function properties: Uniformity

Uniform distribution refers to the even spread of hash codes across the entire range of possible values. A good hash function should distribute the keys uniformly to minimize clustering and reduce the likelihood of collisions.

No hash function is perfect

What happens when hashes do (inevitably) collide? We want to keep our O(1) average insert/lookup/search without throwing away data.

Collision Resolution w/ Separate Chaining

A simple way to deal with collisions is to give each bucket a secondary container for storing all keys that map to a given array index. In this way search becomes a two step process:

Hash the key to find the correct bucket
Search through the secondary list for the key

Collision Resolution w/ Linear Probing

An alternative is to store all items directly in the bucket array, without need for a secondary data structure. This is referred to as an 'open addressing' scheme, because the location of a given item may not necessarily be determined by its hash value.

Linear probing is an open addressing method that performs insertions (for example) by hashing the key and then 'probing' in a linear fashion for an available bucket if the target bucket is occupied. Search and delete operate the same way.

Load Factor

Performance of hash table operations scales with how 'full' it is. We can tell how full it is by calculating the _load factor_ (λ):

λ = n/N

Where n is the total number of items, and N is the size of the bucket array.

Separate Chaining: The average number of keys per list (λ) must be small, or else we revert to O(n) list-based methods within each bucket. In practice λ <= 1 provides acceptable performance.

Linear Probing: In this case λ must always be < 1. When λ grows beyond ~0.5, clustering of keys in the bucket array begin to reduce efficiency of the probing strategy.

Load Factor & Hash Table Resizing

What to do when λ exceeds a tolerable threshold?

We... make... it... bigger!

If we double the size of the bucket array every time we cross the chosen λ threshold we can 'spread out' the costs associated with periodic rehashing (i.e. we maintain O(1) amortized run time).

End of Hash Table Lecture

Here is your obligatory xkcd comic:

NB: The joke here is that hash functions are lossy but deterministic.