[Introduction to Algorithm] Hash Table

Two steps in Hash Table

Compute the hash function
Fix the collision resolution (algorithms and data to handle two keys that hash to the same index)

Hash functions

What is a good hash function?

We scramble the keys uniformly.

Efficiently computable
Each table position equally likely for each key

We expect to be independent of any patterns that might exist in the data

The division method

The hash function can be

\[ h(k)={k}\mod{m} \]

A prime not too close to an exact power of 2 is a good choice for $m$.

The multiplication method

The hash function can be

\[ h( k) =\lfloor m( kA\bmod 1)\rfloor \]

while

\[ kA\bmod 1=kA-\lfloor kA\rfloor \]

$m$ is any value of m, typically to be a power of 2:

\[ m = 2^p \]

According to Knuth,

\[ A\approx \frac{\sqrt{5} -1}{2} \]

Java hashCode()

public int hashCode(String s) {
    int hash = 0;
    for (int i = 0; i < s.length; i++) 
        hash = s[i] + (31 * hash);
    return hash;
}

Matematically, the hash function is

\[ h(x)=\sum_{i=0}^{L=x.length}31^{L-1-i}\cdot x[i] \]

Collision Resolution

Chaining solution

Analysis

Given a hash table $T$ with $m$ slots that stores $n$ elements.

load factor: $\alpha=n/m$ (average number of elements stored in a chain)

For $j=0,1,...,m-1$, denote the length of the list $T[j]$ by $n_j$, so that the expected time to find $n_j$ is $E[n_j]=\alpha=n/m$.

Theorems

In a hash table (chaining), an unsuccessful search takes average-case time $\Theta(1+\alpha)$.
In a hash table (chaining), a successful search takes average-case time $\Theta(1+\alpha)$

Proof:

Let $x_i$ denote the $i$th element inserted into the table, for $i=1,2,...n$, and let $k_i=x_i.key$.

For $k_i$ and $k_j$, we define the indicator random variable $X_{ij}=I\{ h(k_i)=h(k_j)\}$.

Then, we have $Pr\{ h(k_i)=h(k_j)\}=1/m$, and $E[X_{ij}=1/m]$.

Thus, the expected number of elements examined in a successful search is:

\[ \begin{equation} \begin{aligned} &E\left[\frac{1}{n}\sum ^{n}_{i=1}\left( 1+\sum ^{n}_{j=i+1} E[ X_{ij}]\right)\right] \newline &=\frac{1}{n}\sum ^{n}_{i=1}\left( 1+\sum ^{n}_{j=i+1}\frac{1}{m}\right) \newline &=1+\frac{1}{nm}\sum ^{n}_{i=1}( n-i)\newline &=1+\frac{1}{nm}\left(\sum ^{n}_{i=1} n-\sum ^{n}_{i=1} i\right)\newline &=1+\frac{1}{nm}\left( n^{2} -\frac{n( n+1)}{2}\right)\newline &=1+\frac{n-1}{2m}\newline &=1+\frac{\alpha }{2} -\frac{\alpha }{2n} \end{aligned} \end{equation} \]

\[ \therefore \Theta(2+\alpha/2-\alpha/2n)=\Theta(1+\alpha) \]

In average: $O(1)$

Open Addressing

Use an array of size $M >> N$ (typically $M=2N$)

Hashing: map key to integer i between 0 to M-1

Probe sequence

for every key k, the probe sequence: $<h(k,0),h(k,1),...,h(k,m-1)>$

Linear probing

\[ h(k, i) = (h^\prime (k)+i)\mod{m} \]

Insert: put in slot $i$ if free; otherwise try $i+1$, $i+2,$ etc.
Search: search slot $i$; if occupied but not match, try $ i+1$, $i+2$, etc.

primary clustering

This means that long runs of occupied slots build up, increasing the average search time. Clusters arise and make the occupied slots get longer and longer.

Quadratic probing

\[ h(k,i)=(h^\prime (k)+c_1i+c_2i^2)\mod m \]

Drawback

The property leads to a milder form of clustering, called secondary clustering.

Double hashing

\[ h(k,i)=(h_1(k)+ih_2(k))\mod m \]

while

\[ h_1(k)=k\mod m \]

\[ h_2(k)=p-(k\mod p) \]

($p$ is a prime similar to $m$)

Analysis

Given an open-address hash table with load factor $\alpha=n/m < 1$, the expected number of probes in an unsuccessful search is at most $1/(1-\alpha)$.
Inserting an element into an open-address hash table with load factor $\alpha$ requires at most $1/(1-\alpha)$ probes on average.

Proof:

probability first probe successful: $\frac{m-n}{m}$
if first probe fails, probability of second probe successful: $\frac{m-n-1}{m-1}$
if 1st &2nd probe fails, probability of second probe successful: $\frac{m-n-2}{m-2}$
...

Therefore, the total probability of unsuccessful search for $i$ is: \[ \begin{equation} \begin{aligned} Pr\{X\} &=\frac{n}{m} \cdotp \frac{n-1}{m-1} ...\cdotp \frac{n-i+2}{m-i+2}\\ &\leqslant \left(\frac{n}{m}\right)^{i}\\ &=\alpha ^{i} \end{aligned} \end{equation} \]

\[ \begin{equation} \begin{aligned} \therefore E[ X] &=\sum ^{m-1}_{i=0} Pr\{X\}\\ &\leqslant \sum ^{m-1}_{i=0} \alpha ^{i}\\ &=\frac{1}{1-\alpha } \end{aligned} \end{equation} \]

Given an open-address hash table with load factor $\alpha < 1$, the expected number of probes in a successful search is at most $\frac{1}{\alpha}\ln{\frac{1}{1-\alpha}}$.

Proof: the expected number of probes in a search for k is at most $1/(1-i/m) = m/(m-i)$.

Therefore,

\[ \begin{equation} \begin{aligned} \frac{1}{n}\sum ^{n-1}_{i=0}\frac{m}{m-i} \ &=\frac{m}{n}\sum ^{n-1}_{i=0}\frac{1}{m-i}\\ &=\frac{1}{\alpha }\sum ^{m}_{k=m-n+1}\frac{1}{k}\\ &\leqslant \frac{1}{\alpha }\int ^{m}_{m-n}( 1-x) dx\\ &=\frac{1}{\alpha }\ln\frac{m}{m-n}\\ &=\frac{1}{\alpha }\ln\frac{1}{1-\alpha } \end{aligned} \end{equation} \]

algorithm