14A Heaps and Tries

Version 6.11

14.1 Heaps

Definition: a min heap is a binary tree (ordinarily not a binary search tree!) satisfying:

  1. The order property: each node is less than or equal to its children.
  2. The shape property: the tree is complete ("perfect"), except for a possible chunk of rightmost nodes taken out of the bottom level.

    Shape property

    Shape property

    (Recall: perfect means all leaves are on the same level, and all non-leaves have two children.)

We will use the word heap by itself to mean a min heap.

A max heap is like a min heap, except that each node is greater than or equal to its children.

The order property implies that the root is the minimum element, in a min heap; or that it is the maximum element, in a max heap.

The shape property implies that the height of a heap of N nodes is Θ(log N).

(Defer the representation issues, pp. 370-371).

Heap Applications

A. Priority Queues

A priority queue is a sequence of objects ordered by "priority". (Drake takes the priority to be just the object itself.)

Examples:

  1. Preferential treatment of VIPs. King of Denmark.
  2. Musical note event sequences (MIDI, with events arriving on multiple channels).
  3. Simulation: pending events ordered by time.

The operations on a priority queue are:

Heaps can be implemented so that both of these operations run in time Θ(log N).

B. Heap Sort

Heap sort is a comparison sort with worst-case running time Θ(N log N)

Given a list of data to sort, create an empty heap. Insert the list elements one by one into the heap. Then, begin removing and outputting the minimum elements (roots) of the heap; continue doing so until the heap becomes empty.

Question: why is heap sort's running time Θ(N log N)?

Implementing Heaps

Array Representation of Heaps

Because of its shape, a heap can efficiently be represented as an array (or array list). Number the nodes in the order of level order traversal:

Heap with nodes numbered in level order

Heap with nodes numbered in level order

The numbers in parentheses are the node numbers. Then, store each node in an array, using the node number as the array index:

Heap nodes in an array

Heap nodes in an array

We also need to keep track of the heap's size, as in an ArrayList. In fact, we can just use an ArrayList.

To operate on this data structure as a heap, we need to be able to find the parent and children of any node. The following formulas relate the node number (array index) of a parent and its children: the node number is I, its parent is numbered P, the left and right children are nodes numbered L and R.

In both L(I) and R(I), the condition is that the result is within the heap's size; otherwise, the node has no such child.

Heap Sort Revisited

By clever coding, we can implement heap of an array (array list) using no extra space by embedding the heap in the array. The heap is the sorted part of the array, which is initially empty:

Heap sort with a single array

Heap sort with a single array

Removing the Minimum Element

To removeMin(), we can almost just remove the root. The time for that would be Θ(1). But, then we no longer have a heap—nor even a tree!

To rebuild the heap, we move the rightmost element of the bottom level to root position. This restores the shape, but may violate the order property.

To restore order, we (call in the National Guard and), let node = root, compare node with its children, and if node is greater than either child, swap node with its smaller child. Repeat, with the node in its new position, until node finds its proper place (i.e., node <= each child).

Example:

Before removeMin

Before removeMin

removeMin step 1

removeMin step 1

removeMin step 2

removeMin step 2

removeMin step 3

removeMin step 3

We now have the right shape and the right order.

Questions:

  1. How long does removeMin() take, including the operations to restore the heap?
  2. What would happen if, when node is larger than both of its children, we swapped the node with its larger (instead of smaller) child?

Inserting a Prioritized Item

Example:

Before insertPriority

Before insertPriority

insertPriority step 1

insertPriority step 1

The new node initially goes in at the leftmost unoccupied position on the bottom level, as required by the shape property (creating a new level, if needed).

But this may violate the order property. To restore order, we compare the new node with its parent, and, if node < parent, swap them. Repeat with the node in its new position until it finds its place (node ≥ parent).

Example, continued:

insertPriority step 2

insertPriority step 2

insertPriority step 3

insertPriority step 3

Question: How long does insert take, including the time to restore the heap?

Exercises on page 377

14.2 Disjoint Set Clusters

Two sets are disjoint if they have no elements in common. For example, every regular student in an undergraduate college must be either a freshman, sophomore, junior, or senior. So the four classes are disjoint sets.

We will defer consideration of this topic until chapter 15 on graphs.

14.3 Tries ("Digital Search Trees")

Drake calls these "digital search trees",2 but the more common term is trie. Edward Fredkin, inventor of tries, took this name from the word "retrieve" and pronounces it like "tree", but most people say it like "try". Tries are also called prefix trees, which is to me a more descriptive name.

Tries are useful when the keys are sequences, such as strings. Note that comparing strings of length M, N has worst-case time Θ(min(M, N)), whereas comparing two numbers (ints) is Θ(1)). In the Ghost game, used because we need to find words beginning with a prefix (i.e., complete a partially spelled word) and recognize when a word is complete. Can also use in Scrabble for similar reasons.

Main idea: keys are represented as paths through the tree instead of being stored within a node.

A dictionary of words is a good example:

Dictionary of words in a trie

Dictionary of words in a trie

A black node marks the end of a word, but there can be another word that continues from that node.

Algorithms

Insert

Start at the root. For each item in the key sequence, take the branch corresponding to that item, creating a new branch if needed. After the last item, mark the node (X).

Lookup a key

Start at the root. For each item in the key sequence, take the branch corresponding to that item, but if the branch does not exist, immediately fail. On reaching the end of the key sequence, if there's a mark there, then succeed; otherwise fail.

Lookup all completions of a key

To find all completions of a prefix, follow the lookup procedure to the end of the prefix, and then return the keys formed from all paths continuing through that point.

From Trie Sets to Trie Maps

Note that these ideas could easily be extended to a map by associating the end-of-key marker with more data (instead of just true or false).

Implementing Tries

Drake uses a map at each node to associate key items (such as letters) with children (branches). Alternatively, We could use an array, if the number of key items is small, but this might require more space.

Performance

Assuming a good hash table for the branch map in each node, lookup and insert are both Θ(L), where L is the maximum length of a search key. Why?

Suffix trees are like prefix trees, but the keys are stored in reverse order, making it easy to answer questions like what words end in "-lay"?

Thoughts

What if the keys are not sequences, but can be serialized with a maximum length? For example, 32-bit integers can be serialized into 32 bits. Wouldn't this allow us to use a trie with Θ(L) = Θ(1) time for insert and lookup time?

Footnotes


  1. Revision log:
    • Version 6.1, 2010 Nov 16. Typographical corrections (Θ); added figure showing a trie.
    • Version 6, 2010 Nov 15. Converted to markdown, added graphics, added thoughts on tries.
    • Version 5, 2009 Nov 16. Converted to RST.
    • Version 4, 2008 Nov 4. Need to insert and re-number the figures.
    • Version 3, 2008 Nov 3. Split file into 14a heaps and tries and 14b red-black trees. Added text formatting. Need to add figures to the web page.
    • Version 2, 2006 Nov 14. Complete except for deletion from red-black trees.
  2. Why "digital"? Isn't every computer search tree digital, since computers are digital machines? Does he mean some analogy to the human hand, with the palm branching out to the fingers? But isn't every tree like that? Or is it because of Fredkin's interest in "digital physics" and "digital philosophy"?