A: BinarySearchTree.hs, TreeSet.hs, TreeSetTest.hs
How big is it?
$ wc /usr/share/dict/words
What does it look like?
$ less /usr/share/dict/words
How do these operations work?
S = [1, 3, 7, 21, 24, 27, 32] S.contains(7) S.contains(23) S.add(18) S.remove(3) S.remove(23) S.size()
Let S be the following binary tree:
How do these operations work, in the given S (always restarting with the original S)?
This could be done two ways: we can replace 21 with either (a) its inorder predecessor, 7, or (b) its inorder successor, 24. Any other element would violate the order property of the binary search tree. The predecessor is the rightmost descendant in the left subtree. The successor is the leftmost descendant in the right subtree.
S.contains(7) S.contains(23) S.add(18) S.remove(7) S.remove(24) S.remove(27) S.remove(21)
If the binary search tree is poorly balanced (a linear binary tree or something similar)
Ord; similar to Java's
Comparable; also allows
/=(not equal to)
_for matching anything (in general, any variable name beginning with
_plays this role)
instancedeclaration instead of automatically derived
Showinstance further supports information hiding
subset(i.e., is A a subset of B?), equality (instance for
Eqwith a definition of
==; sets A and B are equal if each is a subset of the other).
hashCode()method that does this.
a.equals(b), then we should have
a.hashCode() == b.hashCode(). (Of course the converse is not true: two elements can have the same hash code but be different.)
Ideally, the hash code and hash function assign each element of the set to a different array index. But in reality, some elements are assigned to the same "slot". These are called "collisions."
Two principal methods of dealing with (resolving) collisions:
Suppose we want to insert k, and the hash function h gives h(k) = i. If slot i is unoccupied, we store k right there. Otherwise, we try slots i + 1, i + 2, etc., stop when we find an unoccupied slot, and store k there. If we reach the end of the array we wrap around to the beginning. If we come back to our starting point, the array is full.
We can deal with a full array3 by rehashing: "stretching" the array, as we did for ArrayList, and applying the hash function again (or a new one) to insert all the elements into the larger array.
The major problem with linear probing is clustering: groups of adjacent slots tend to fill up. This causes long searches just to find if an element is in the set, long searches for insertion (to find an empty slot), and long searches to find an element to be deleted.
A secondary complication of linear probing (in fact of any kind of open addressing) is that it makes deleting an element more complicated. Assuming "null" is the contents of an empty slot, when we delete an element, we cannot simply store "null" in the slot it occupied — because it might not be the last of the elements with the same hash value.
Assume h(A) = h(B) = h(C) = 3. Insert A, B, C:
If we now remove B by storing null into slot 4, what happens if we want to find C? ....
The solution is to store a special "deleted" value in the slot where items have been deleted, which is different from null and also different from every possible element value. Then when we are searching for C and come to a "deleted" slot, we know we have to keep looking.
Not the same as rehashing! Avoid clustering by using a second hash function to determine the "increment" for address probing (in linear probing, the increment is always 1).
Let h1 and h2 be the two hash functions. If h1(k) = i and h2(k) = j, then (as usual) we first try slot i. If it is occupied, then instead of trying slots i + 1, i + 2, ..., we try slots i + j, i + 2j, .... We still wrap around from end to beginning of array if needed.
We need to choose our hash functions carefully to avoid early failures (premature cycles). For example, if there are 100 buckets and h2(k) = 50, then we will try slots h1(k), (h1(k) + 50) % 100, (h1(k) + 100) % 100 = h1(k). We are quickly back where we started!
Drake's implementation avoids this problem by (1) making the array size always be a power of two, and (2) making the second hash function always return an odd number (how is that done?).
How would this work with 1024 slots and h2(k) = 511? .... This looks good, but can we be sure it's always good? .... Relatively prime ....
Rehashing (when we run out of space in the array) works hand in hand with this strategy: we start with array size 1, and then every stretch doubles its size, so the sizes are 1, 2, 4, ... — always powers of two.
Implementation details: Figures 11-36 ... 11-39, pp. 312-315:
My weird perspective: chaining and open addressing are two varieties of the same thing! In both there is a "chain" or sequence of locations in which to search for (or insert, or delete) a given element. With chaining, these are off to the side and occupy additional space, including space for pointers to link the nodes. With open addressing, the sequence is embedded in the array and the pointers are not stored. See the diagrams:
Maps are sets of keys with associated values (also called dictionaries or associative arrays).4 For example, a name-age map would enable us to find a person's age, given the person's name. We tell it the key (the name) and ask it to look up the value (age).
The java.util package also provides the interfaces Map and SortedMap, and the classes HashMap and TreeMap. (Note: HashSet implements Set; TreeSet implements SortedSet. Similarly, HashMap implements Map, and TreeMap implements SortedMap.)
Figure 11-42, p. 319, summarizes the running times of data structures covered in this chapter, as well as a few to be covered later.
We could relax either the requirement that all elements in the left subtree are < root, changing < to ≤, or the requirement that all elements in the right subtree are > root, changing the > to ≥ — but not both! — allowing the tree to contain elements equal to the root. But then it would not be a set. ↩
Actually we should deal with it long before it becomes completely full. ↩
It is not quite right to say maps are sets of (key, value) pairs. A set of (key, value) pairs could have no two elements (k1, v1), (k2, v2), such that k1 = k2 and v1 = v2. The condition for being a map is a little stronger: we can have no two elements (k1, v1), (k2, v2) such that k1 = k2. That is, there can be no duplicate keys, regardless of whether they have the same value. If ("George", 12) is in the name-age map, then not only can there not be another ("George", 12) in there, but there cannot be ("George", 10) either. ↩