Chapter 11: Hash Tables
Data Structures and Algorithms in Java
CpSc 374
Ideal Hash Table
- Stores unique data in an array (no duplicates)
- where the key is associated with the index
- by a hashing function
- The mapping is unique, that is, a particular key value yields an index (i) and no other key will produce the same index (i)
- A one-to-one function
- Really ideal would allow no wasted space in the array
- A one-to-one, onto function (bijective)
Really Ideal Example
- Employee numbers
- Assigned from 1 to numEmployees
- No deletions - just flag the record
- New employee added at end (numEmployees++)
- Employee number is the index
- Insert, delete, search are all O(1)
- Practical?
- Wal-Mart has 2.2 million employees world-wide
- If each record contains only an integer reference (index),
- the array size is over 8 Mb
Ideal Example - Words
English Dictionary (a to zyzzyva)
- Consider words to be numbers
- Each letter represents a digit...
- Convert to a base 27 number
- Blank --> 0
- a --> 1
- z --> 26
- Convert to decimal to get index
- cats => 3*273 + 1*272 + 20*271 + 19*270
- --> 3 1 20 1927 = 60,33710
- But we have empty cells for neighbors:
- ...catp, catq, catr, catt, catu...
- Four letter words implies
- zzzz => 26*273 + 26*272 + 26*271 + 26*270 = 531,440 entries
- 32 Gb RAM => 235 = 3.5E10
- The log base 27 of 3.4E10 is 7.4 => 7 letter words (max)
- The longest word we can represent in 32GB is 7 letters long
- Pseudopseudohypoparathyroidism (30 chars)
- Longest non-coined word in a major dictionary (wikipedia)
Word Problems
- Words/strings are commonly "hashed"
- Compiler symbol tables
- reduce the range of the mapping function
- Allow mapping function to be many:1
- apply an alternate strategy if a collision occurs
- Goal: ensure ~O(1) for insert, delete & search
Dictionary Example
- Assume 50,000 words
- Make array twice this size: 100,000
- Hashing function:
- Calculate the unique location of a word: bigIndex
- Return smallIndex = bigIndex % size_of_array
- Multiple words now hash to same location
- Need to handle collisions
- extra space (50,000)
- zyzzyva => 26*276 + 25*275 + 26*274 + 26*273 + 25*272 + 22*271 + 1*270
- bigIndex == 10,446,003,433
- A.length == 100,000
- smallIndex = 3,433
Collision Resolution Strategies
- Separate chaining (https://www.cs.usfca.edu/~galles/visualization/OpenHash.html)
- Create a linked list of elements for each location in the array
- Java uses this closed addressing, "bucket" approach
- Open addressing (https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html)
- When a collision occurs a cell is occupied
- Search for an empty cell starting at the hash location
- Insert, delete and find operations all must use same strategy for searching
Open Adressing - Linear Probing
Insert Operation
- index = h(key)
- If A[index] is occupied already
// find the next availabe spot
- While (! available(A[++index])) /do nothing/ ;
- Place item in location index
- Note that if table is full => infinite loop, so check!
- index = (index + 1) % A.length(); // circular array
Find Operation
- index = h(key)
- If A[index] is occupied & key does not match
- While (! empty(A[++index])
- If key matches return index
- Else keep going
- Circular increment
- Didn't find it
Delete Operation (often not allowed)
- index = h(key)
- While (! empty(A[++index])
- If key matches
- mark as deleted
- return item
- Circular increment
- return error (null)
Note that if table is full => infinite loop, so check!
Also note, deleted => now available for an insert
Clustering can ocur
- A filled sequence of cells is called a cluster
- Probe length =
- number of steps taken from the hashed index
- i.e., a count of the ++index operations needed
- Ranges from 1 to N (is N if every key hashed to same index )
- Clustering becomes worse as array fills up
- More likely to hit a cluster, probes are longer
- Severe degradation when load factor is more than 2/3
- load factor = ratio of number of items to array size
- When load factor get high enough...
- Grow & rehash everything
Open Addressing - Quadratic Probing
- Linear probing uses a step size of 1
- h(key)+1, h(key)+2, h(key)+3, h(key)+4,...
- Quadratic Probing uses square of the step
- h(key)+1, h(key)+4, h(key)+9, h(key)+16,...
- Visualization
- Avoids primary clustering by trying to jump over
- Quadratic Probing Issues:
- may overflow the int variable (after ~46K steps in Java)
- quadratic pattern may cause secondary clustering
- Cells that hash to same location will all try the same sequence of steps
- May go into infinite loop if size of array is not a prime number
Open Addressing - Double Hashing
- Clustering occurs because
- For primary, we use the same stepSize for all keys
- For secondary, a set of keys that hash to the same index will then follow the same sequence of probes
- Double hashing uses a constant step size > 0
- Determined by a second (different) hash function
- A well-behaved secondary hash function:
- K == prime number < size of array (also a prime)
- stepSize = K - (key % K) // use key, not h(key)
- Visualization
-
Hash Double Code
Hash Table Problems
- Based on arrays - not dynamic
- Vectors don't solve the problem
- Can't simply increase size and copy elements
- Each element has to be re-hashed into the larger array (or vector)
- Can't easily "traverse" the hash table in order
- Severe performance degradation caused by collisions
- high load factor
- Select size of array so that...
- It never becomes more than half-full
- It doesn't have to be resized
- Re-hashing everything takes a long time
- Want...
- Every cell is eventually visited by collision algorithm
- Collision algorithm stops... no infinite loops
- Smallest prime number >= 2 * size of data space
Hash Table w/ Separate Chaining
- The hash table itself is an array of linked lists
- Initially all lists are empty
- When a key maps to a particular index
- Insert the item into the linked list at that index
- typically InsertInOrder()
-
- Hash chain Code
Visualization
- Another View - add new collision record at either end of list
-
Chaining Considerations
- Load factor may be > 1 (graceful degradation)
- N or more items in an array of size N
- Still want a prime number for array size
- Let M be the average number of items in a list
- Insert, find & delete are O(M) for a sorted list
- Still go directly to cell that holds this list
- Duplicates not a problem
- Buckets: use a static array at each cell
- Buckets may waste space or overflow
Hash Function Strategies
- Use only "real" data, get rid of excess bits
- Use all of the "real" data
- tableSize is prime --
Otherwise keys that share a divisor with tableSize will tend to cluster
- Lafore's hash function for words meets strategies.
But will overflow for long strings
- What is real data?
Hashing Strings
public static hashFunc3(String key) {
int hashVal = 0;
for (int j=0; j<key.length(); j++) {
int letter = key.charAt(j) - 96; // get char code
hashVal = (hashVal * 27 + letter) % arraySize;
} // end for
return hashVal;
} // end hashFunc3()
WorkSheet