Lab 8

In today's lab, we will look at creating good hash functions. This very nice exercise is courtesy of Ashwin Thangali, another TF who has taught this class before.

A hash function takes a data object as input and returns an integer index to a hash table bin, which is computed as a function of the input object. The number of hash buckets is typically fewer than the number of data objects causing collisions, most bins will contain more than one data object. A good hash function is essential to distribute data uniformly among the bins to gaurantee efficient retrieval. A badly designed hash function tends to clump data into a few bins. Since a linear search among objects in a bin is used to find a match, the search performance is drastically affected.

Today we will look at hashing names of U.S. counties (e.g., "Washington County, Alabama"). Our first attempt at a hash function simply returns the ascii value of the first character in the county name, modulo the number of hash bins. If you download and, and run, it will launch a window displaying the distribution of county names among the bins. You can vary the number of bins by clicking at the top of the window. We can see that our hash function is not very good because the objects are not evenly distributed: some bins are not getting used at all, and among the bins that are getting used some bins have a lot more objects than others.

The default hash function is implemented in the hashFunction() method of Your task is to modify this method to yield a uniform (nearly flat) distribution of counties among the hash bins. Ensure that your function yields good results by varying the number of bins and looking at the resulting distributions.

When you are finished gsubmit your modification of in a directory called lab8. Also, please explain why the original hash function is not good. Hint: explain why we cannot hash to some bins and why the distribution is uneven among the other bins. Put your answer as comments in the hashFunction() method.