Some people drink a lot of coffee (like me), and some people drink tea, and some people don't drink any caffienated beverages at all.
If you've never thought about _why_ this is, you are in for a treat!
Changes in the DNA of your genome can make you taller, give
you a rosier disposition, or make you sneeze in bright sunlight. And
changes in some parts of your genome influence coffee
consumption propensity.
There are dozens of coffee
consumption-associated genes and you might have some, none, or
all of them!
# Python flavored pseudocode
for SNP in my_genome:
if SNP is coffee_related:
my.coffee_consumption += 1
else:
pass on coffee
There are 100,000+ genetic variants (Single-nucleotide polymorphisms; SNPs)
in the SNPedia database,
all of which have SNP IDs and
genomic coordinates
(where they live in the genome). If we want to process information about
these key/value pairs (i.e. SNPs & coordinates) how can we store this information efficiently?
Linked Lists: Good old linked lists. Of course could use these but
they are slow, with searching, and arbitrary insertion/deletion O(n)
Binary Search Trees: Good old BST are better, with insert/delete/search of O(log n).
We know that array indexing is O(1), so we could store values
in an array, if only we could figure out how to transform arbitrary
keys (like SNP IDs) into integer value appropriately scaled to the size of our array.
It turns out this is _exactly_ what Hash Tables are designed to do! They combine the magic of hash functions with the efficiency of array access to create a practical and efficient data structure.
The primary components of a hash table are:
In the simplest terms it is a function that maps arbitrary input data to an integer value.
A good hash function must be deterministic, meaning it consistently produces the same hash code for a given input. This ensures that you get the same result every time you hash a particular key, essential for accurate data retrieval.
The hash function should be computationally efficient, meaning it should quickly compute the hash code for any given key. An efficient hash function is important for ensuring overall hash table peformance.
Uniform distribution refers to the even spread of hash codes across the entire range of possible values. A good hash function should distribute the keys uniformly to minimize clustering and reduce the likelihood of collisions.
What happens when hashes do (inevitably) collide? We want to keep our O(1) average insert/lookup/search without throwing away data.
A simple way to deal with collisions is to give each bucket a secondary container for storing all keys that map to a given array index. In this way search becomes a two step process:
An alternative is to store all items directly in the bucket array,
without need for a secondary data structure. This is referred to as an
'open addressing' scheme, because the location of a given item may not
necessarily be determined by its hash value.
Linear probing is an open addressing method that performs insertions (for
example) by hashing the key and then 'probing' in a linear fashion
for an available bucket if the target bucket is occupied. Search and delete
operate the same way.
Performance of hash table operations scales with how 'full' it is. We can
tell how full it is by calculating the _load factor_ (位):
位 = n/N
Where n is the total number of items, and N is the size of
the bucket array.
Separate Chaining: The average number of keys per list (位) must be small,
or else we revert to O(n) list-based methods within each bucket. In
practice 位 <= 1 provides acceptable performance.
Linear Probing: In this case 位 must always be < 1. When 位 grows
beyond ~0.5, clustering of keys in the bucket array begin to reduce
efficiency of the probing strategy.
What to do when 位 exceeds a tolerable threshold?
We...
make...
it...
bigger!
If we double the size of the bucket array every time we cross the
chosen 位 threshold we can 'spread out' the costs associated with
periodic rehashing (i.e. we maintain O(1) amortized run time).
Here is your obligatory xkcd comic:
NB: The joke here is that hash functions are
lossy but deterministic.