Thursday, January 03, 2008

Notes On Hashing

  • Technique that promises near-ϑ(1) data lookup, compared with ϑ(n) for linked lists or ϑ(log n) for binary search trees.
  • Uses a hash function - a means of interpreting some object numerically (with an integer usually) to determine its position in some table (hashtable).
  • There's no order (e.g. sorting) to items stored in the hashtable - it all seems random and is the reason this process is sometimes called randomizing.
  • Implementations prone to inefficient storage usage and high rates of collisions when used slots are far-flung and many entries end up with the same hash code, respectively.
  • There's no such thing as a perfect hashing algorithm - there will always be some collisions and space will never always be utilized efficiently enough - a good trade-off is needed.
  • Prime numbers are king when calculating hashes.
  • The JDK's String class has the simplest and yet effective hashing algorithm I have ever seen!
  • When a hashing table fills up, you need to resize it dynamically.
  • Two techniques exist that help reduce the need to resize hashtables: (1) probing, and (2) bucketing.
  • Linear probing = if a collision occurs, look for the next available slot (including wrapping to the head of the table if necessary) to place the object in. At some point with more and more objects being added, performance degrades to linear/brute-force ϑ(n).
  • Bucketing = maintain a collection at each hashtable slot so that each hashcode can have more than one object.
  • One way to determine when to resize a hashtable is to keep track of a load factor = the ratio of stored values to available buckets (for bucketing).
  • A load factor of about 75% is a fairly good trade-off between space and time, and works well for most implementations.
  • Bucketing is the most efficient of the two techniques. Nor surprisingly, the JDK version implements bucketing with a default load factor if 0.75.
  • Modern hashing functions are based on serious mathematical analysis (beyond me). There are several websites I came across with some good hashing algorithms for everything from cryptography to applications in geometry: Arash Partow's site and Hash Algorithm Directory are just two of the many out there.
  • HowStuffWorks.com discusses hashing algorithms in their article about encryption.
  • Most everyday programmers don't deal with hashing (designing and implementing algorithms) beyond using data structures available in most APIs. I've never had to do anything like that beyond using the Hashtable and HashMap classes provided in the JDK. Always preferable to use any of the good GUID generators currently available (which probably use some sort of hashing internally). The advantage is that they guarantee you will have a unique identifier for every object in the proper context.