11 Sets

Version 6.31

Handouts:

A: BinarySearchTree.hs, TreeSet.hs, TreeSetTest.hs

B: None

Part A: Sets, Lists, and Binary Search Trees

The Concept of Sets

11.1 The Set Interface

The Game of Anagrams

11.2 Ordered Lists as Sets

11.3 Binary Search Trees

Binary Search Trees: A Functional Implementation

Haskell code:

Part B: Hash Tables, Java Collections Framework

11.4 Hash Tables

Collisions

Linear Probing

Suppose we want to insert k, and the hash function h gives h(k) = i. If slot i is unoccupied, we store k right there. Otherwise, we try slots i + 1, i + 2, etc., stop when we find an unoccupied slot, and store k there. If we reach the end of the array we wrap around to the beginning. If we come back to our starting point, the array is full.

We can deal with a full array3 by rehashing: "stretching" the array, as we did for ArrayList, and applying the hash function again (or a new one) to insert all the elements into the larger array.

The major problem with linear probing is clustering: groups of adjacent slots tend to fill up. This causes long searches just to find if an element is in the set, long searches for insertion (to find an empty slot), and long searches to find an element to be deleted.

A secondary complication of linear probing (in fact of any kind of open addressing) is that it makes deleting an element more complicated. Assuming "null" is the contents of an empty slot, when we delete an element, we cannot simply store "null" in the slot it occupied — because it might not be the last of the elements with the same hash value.

Example:

Assume h(A) = h(B) = h(C) = 3. Insert A, B, C:

If we now remove B by storing null into slot 4, what happens if we want to find C? ....

The solution is to store a special "deleted" value in the slot where items have been deleted, which is different from null and also different from every possible element value. Then when we are searching for C and come to a "deleted" slot, we know we have to keep looking.

Double Hashing

Not the same as rehashing! Avoid clustering by using a second hash function to determine the "increment" for address probing (in linear probing, the increment is always 1).

Let h1 and h2 be the two hash functions. If h1(k) = i and h2(k) = j, then (as usual) we first try slot i. If it is occupied, then instead of trying slots i + 1, i + 2, ..., we try slots i + j, i + 2j, .... We still wrap around from end to beginning of array if needed.

We need to choose our hash functions carefully to avoid early failures (premature cycles). For example, if there are 100 buckets and h2(k) = 50, then we will try slots h1(k), (h1(k) + 50) % 100, (h1(k) + 100) % 100 = h1(k). We are quickly back where we started!

Drake's implementation avoids this problem by (1) making the array size always be a power of two, and (2) making the second hash function always return an odd number (how is that done?).

How would this work with 1024 slots and h2(k) = 511? .... This looks good, but can we be sure it's always good? .... Relatively prime ....

Rehashing (when we run out of space in the array) works hand in hand with this strategy: we start with array size 1, and then every stretch doubles its size, so the sizes are 1, 2, 4, ... — always powers of two.

My weird perspective: chaining and open addressing are two varieties of the same thing! In both there is a "chain" or sequence of locations in which to search for (or insert, or delete) a given element. With chaining, these are off to the side and occupy additional space, including space for pointers to link the nodes. With open addressing, the sequence is embedded in the array and the pointers are not stored. See the diagrams:

11.5 The Java Collections Framework Again

Footnotes


  1. Revision history:
    • Version 6.3, 2010 Nov 3. Fixed subscripts (h1, h2) and a few other markdown syntax problems; minor changes of punctuation and phrasing.
    • Version 6.2, 2010 Nov 1. Converted to markdown, replaced two ASCII diagrams with structview diagrams, replaced functional Java binary search tree with Haskell implementation. This could be improved by showing how to handle exceptions gracefully, instead of with the error function.
    • Version 6.1, 2009 Nov 5. Correction of inorder successor in the binary tree S, clarification of the term map.
    • Version 6, 2009 Nov. 4. Added hash table diagrams.
    • Version 5, 2009 Nov. 2. Converted to reStructured Text. Still missing some diagrams.
    • Version 4, 2008 Oct 25. Still missing some diagrams.
    • Version 3, 2007 Oct 29. 99% complete, still missing some diagrams.
    • Version 2, 2006 Nov 1.
    • Version 1, 2006 Oct 25.
  2. We could relax either the requirement that all elements in the left subtree are < root, changing < to ≤, or the requirement that all elements in the right subtree are > root, changing the > to ≥ — but not both! — allowing the tree to contain elements equal to the root. But then it would not be a set.

  3. Actually we should deal with it long before it becomes completely full.

  4. It is not quite right to say maps are sets of (key, value) pairs. A set of (key, value) pairs could have no two elements (k1, v1), (k2, v2), such that k1 = k2 and v1 = v2. The condition for being a map is a little stronger: we can have no two elements (k1, v1), (k2, v2) such that k1 = k2. That is, there can be no duplicate keys, regardless of whether they have the same value. If ("George", 12) is in the name-age map, then not only can there not be another ("George", 12) in there, but there cannot be ("George", 10) either.