Notes

Dr. Deborah Whitfield

Chapter 11: Hash Tables
Data Structures and Algorithms in Java
CpSc 374

Ideal Hash Table

Stores unique data in an array (no duplicates)
where the key is associated with the index
by a hashing function

The mapping is unique, that is, a particular key value yields an index (i) and no other key will produce the same index (i)

A one-to-one function

Really ideal would allow no wasted space in the array
A one-to-one, onto function (bijective)

Really Ideal Example

Employee numbers

Assigned from 1 to numEmployees
No deletions - just flag the record
New employee added at end (numEmployees++)

Employee number is the index

h(x) = x

Insert, delete, search are all O(1)
Practical?

Wal-Mart has 2.2 million employees world-wide
If each record contains only an integer reference (index),
the array size is over 8 Mb

Ideal Example - Words
English Dictionary (a to zyzzyva)

Consider words to be numbers

Each letter represents a digit...

Convert to a base 27 number

Blank --> 0

a --> 1

z --> 26

Convert to decimal to get index
cats => 3*27³+ 1*27²+ 20*27¹ + 19*27⁰
--> 3 1 20 19₂₇= 60,337₁₀
But we have empty cells for neighbors:

...catp, catq, catr, catt, catu...

Four letter words implies

zzzz => 26*27³ + 26*27² + 26*27¹ + 26*27⁰ = 531,440 entries

32 Gb RAM => 235 = 3.5E10

The log base 27 of 3.4E10 is 7.4 => 7 letter words (max)

The longest word we can represent in 32GB is 7 letters long

Pseudopseudohypoparathyroidism (30 chars)

Longest non-coined word in a major dictionary (wikipedia)

Word Problems

Words/strings are commonly "hashed"
Compiler symbol tables
reduce the range of the mapping function

Allow mapping function to be many:1

apply an alternate strategy if a collision occurs

Goal: ensure ~O(1) for insert, delete & search

Dictionary Example

Assume 50,000 words
Make array twice this size: 100,000
Hashing function:

Calculate the unique location of a word: bigIndex
Return smallIndex = bigIndex % size_of_array
Multiple words now hash to same location
Need to handle collisions

extra space (50,000)
zyzzyva => 26*27⁶ + 25*27⁵ + 26*27⁴ + 26*27³ + 25*27² + 22*27¹ + 1*27⁰
bigIndex == 10,446,003,433
A.length == 100,000
smallIndex = 3,433

Collision Resolution Strategies

Separate chaining (https://www.cs.usfca.edu/~galles/visualization/OpenHash.html)

Create a linked list of elements for each location in the array

Java uses this closed addressing, "bucket" approach

Open addressing (https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html)

When a collision occurs a cell is occupied

Search for an empty cell starting at the hash location

Insert, delete and find operations all must use same strategy for searching

Open Adressing - Linear Probing

Insert Operation

class DataItem, class HashTable: Hash Code
Visualization

index = h(key)
If A[index] is occupied already

While (! available(A[++index])) /do nothing/ ;
Place item in location index
Note that if table is full => infinite loop, so check!
index = (index + 1) % A.length(); // circular array

Find Operation

index = h(key)
If A[index] is occupied & key does not match
While (! empty(A[++index])
1. If key matches return index
2. Else keep going
3. Circular increment
Didn't find it

Delete Operation (often not allowed)

index = h(key)
While (! empty(A[++index])
1. If key matches
  mark as deleted
  return item
2. Circular increment
return error (null)

Note that if table is full => infinite loop, so check!

Also note, deleted => now available for an insert

Clustering can ocur

A filled sequence of cells is called a cluster
Probe length =

number of steps taken from the hashed index
i.e., a count of the ++index operations needed
Ranges from 1 to N (is N if every key hashed to same index )

Clustering becomes worse as array fills up
More likely to hit a cluster, probes are longer
Severe degradation when load factor is more than 2/3
load factor = ratio of number of items to array size
When load factor get high enough...
Grow & rehash everything

Open Addressing - Quadratic Probing

Linear probing uses a step size of 1

h(key)+1, h(key)+2, h(key)+3, h(key)+4,...

Quadratic Probing uses square of the step

h(key)+1, h(key)+4, h(key)+9, h(key)+16,...

Visualization
Avoids primary clustering by trying to jump over
Quadratic Probing Issues:
1. may overflow the int variable (after ~46K steps in Java)
2. quadratic pattern may cause secondary clustering
3. Cells that hash to same location will all try the same sequence of steps
4. May go into infinite loop if size of array is not a prime number

Open Addressing - Double Hashing

Clustering occurs because

For primary, we use the same stepSize for all keys
For secondary, a set of keys that hash to the same index will then follow the same sequence of probes

Double hashing uses a constant step size > 0
Determined by a second (different) hash function
A well-behaved secondary hash function:

K == prime number < size of array (also a prime)
stepSize = K - (key % K) // use key, not h(key)

Visualization
Hash Double Code

Hash Table Problems

Based on arrays - not dynamic
Vectors don't solve the problem

Can't simply increase size and copy elements
Each element has to be re-hashed into the larger array (or vector)

Can't easily "traverse" the hash table in order
Severe performance degradation caused by collisions
high load factor

Select size of array so that...

It never becomes more than half-full

It doesn't have to be resized

Re-hashing everything takes a long time

Want...

Every cell is eventually visited by collision algorithm
Collision algorithm stops... no infinite loops
Smallest prime number >= 2 * size of data space

Hash Table w/ Separate Chaining

The hash table itself is an array of linked lists
Initially all lists are empty
When a key maps to a particular index

Insert the item into the linked list at that index
typically InsertInOrder()

Hash chain Code Visualization
Another View - add new collision record at either end of list

Chaining Considerations

Load factor may be > 1 (graceful degradation)

N or more items in an array of size N
Still want a prime number for array size
Let M be the average number of items in a list

Insert, find & delete are O(M) for a sorted list

Still go directly to cell that holds this list

Duplicates not a problem
Buckets: use a static array at each cell
Buckets may waste space or overflow

Hash Function Strategies

Use only "real" data, get rid of excess bits
Use all of the "real" data
tableSize is prime -- Otherwise keys that share a divisor with tableSize will tend to cluster
Lafore's hash function for words meets strategies. But will overflow for long strings
What is real data?

Hashing Strings

public static hashFunc3(String key) { 
   int hashVal = 0;
   for (int j=0; j<key.length(); j++) {
       int letter = key.charAt(j) - 96;  // get char code 
       hashVal = (hashVal * 27 + letter) % arraySize; 
       } // end for 
   return hashVal; 
 } // end hashFunc3()

WorkSheet