12 Advanced Linear Structures
Version 3
Handouts:
12.1 Bit Vectors
Another representation of sets. The goal is to save space. In the example, each bit represents a game feature (true if the game has the feature, false otherwise), and each game is represented by a vector of bits.
In addition to saving space, there are computational advantages:
- Set intersection is bitwise AND
- Set union is bitwise OR
- Set complement is bitwise NOT
- These are very fast operations, a single machine instruction on word-size operands (i.e., 32-bit operands on a 32-bit CPU, or 64 on 64).
Bitwise Operators
To keep the examples visually intelligible, we are using small numbers of bits. In Java, the operations can be applied to 32-bit integers (int type).
Bitwise Logical Operators
AND (&) 1100 OR (|): 1100 XOR (^): 1100 NEG (~): 10
1010 1010 1010 --
---- ---- ---- 01
1000 1110 0110
Shift Operators
The left shift (<<) shifts bits to the left, pushing some bits out of the word and bringing in zeroes from the right to fill the gap. The logical right shift (>>>) does exactly the opposite: it shifts bits to the right, pushing some bits out of the word and bringing in zeroes from the left to fill the gap.
Left shift: 11010111
<< 3
--------
(110 out ←) 10111000 (← 000 in)
Logical right shift: 11010111
>>> 3
--------
(000 in →) 00011010 (→ 111 out)
The arithmetic right shift (>>) shifts bits to the right, pushing some bits out of the word and bringing in zeroes or ones to fill the gap, depending on the leftmost bit in the number to be shifted. The leftmost bit is called the sign bit: if it is 1, the number is negative; if 0, the number is zero or positive. The arithmetic right shift brings in copies of the sign bit from the left. Since this preserves the sign of the number, it is called "arithmetic" right shift.
Arithmetic right shift, sign bit = 1: 11010101
>> 3
--------
(111 in →) 11111010 (→ 101 out)
Arithmetic right shift, sign bit = 0: 01010101
>> 3
--------
(000 in →) 00011010 (→ 101 out)
Observe that:
- If the number to be shifted is non-negative, then
>> and >>> give the same result. - If overflow does not occur, left shifting can be used to multiply by powers of two, and arithmetic right shifting to divide by powers of two. On some hardware, these are faster than the corresponding arithmetic operations, especially for division.
- Why? Discuss binary number representation if needed. … (including?) two's complement; also hexadecimal?
Notational Infelicity
Since the logical right shift is the counterpart of the left shift (<<), wouldn't it make better sense if the operator symbol for it were >>, and use >>> for the arithmetic right shift?
Applications
- Figures 12-5 through 12-9, pp. 328–333: game collection program.
Note the use of bit vector constants, e.g.,
DEXTERITY = 1 << 16
Note the use of OR (|) to combine these constants.
Example
- BitVector.java
- "Real programmers" tend to use hexadecimal notation for their bit vectors.
Bitsets
The class java.util.BitSet provides an extension of bit vectors to allow sizes larger than 64 bits. The methods correspond to the bitwise operators, but they have different names
- See Figure 12-10, p. 334.
- See Java API doc: class java.util.BitSet
- Example: BitSetDemo.java
- A BitSet merely records whether an integer key is in the set or not.
- To associate other information with this integer, we use an index (key → object); the object also contains its key so a reverse index is not needed.
Exercises: 12.1–12.7, page 334.
12.2 Sparse Arrays
The problem: wasted space for arrays with one value that is overwhelmingly frequent (typically zero, but could be anything).
See Figures 12.14 and 12.15, p. 337.
12.3 Contiguous Representation of Multidimensional Arrays
- How does Java normally store two-dimensional arrays? …
- The contiguous representation: project into one dimensional array by putting all elements of row 0 first, then all elements of row 1, and so on.
Example: the 3×4 array
{{A, B, C, D},
{E, F, G, H},
{I, J, K, L}}
is represented as
{A, B, C, D, E, F, G, H, I, J, K, L}
- Finding the element in row i, column j:
- There are i rows before it, with 4 elements each, so 4 elements from previous rows.
- There are j elements before it in its own row.
- So, the one-dimensional index is [4 i + j].
- Generalize to an arbitrary number of columns N.
- Can be generalized to higher than two dimensions, but we omit details.
- This layout is called row-major order, because it stores all elements of a row contiguously, then all elements of the next row, etc. It is used by C, Java, and many other programming languages.
An alternative, column-major order, is used by FORTRAN and a few other languages. It stores all elements of a column contiguously:
{A, E, I, B, F, J, C, G, K, D, H, L}
Evaluation of the contiguous array representation
Pro: Data are tightly packed, yielding better locality of reference and consequently better memory performance (cache, or virtual memory).
Con: It's a lot of work, and doesn't really save very much memory, so the cache benefits are probably quite small.
12.4 Searching and Sorting
a. Interpolation Search
- We're looking for a number y.
- The numbers y[x1], y[x1 + 1], …, y[x2] are stored in ascending order in an array.
- We think the y values are approximately uniformly distributed.
- I.e., for any four indices s, t, u, v, if t − s = v − u, i.e., the index pairs (s, t) and (u, v) delimit regions of equal size, then y[t] − y[s] = approx. y[v] − y[u], i.e., the corresponding ranges of y values in (s, t) and (u, v) are approximately equal.
We make an educated guess about where y is located, using linear interpolation search:
|
y2 + o
| o
| o
y + o
| o
y1 + o
|
|
+--------------+----+-----+----
x1 x? x2
For example, with x1 = 40, y1 = 2000, x2 = 100, y2 = 5000, and searching for y = 2500:
(y − y1) / (y2 − y1) = (2500 − 2000) / (5000 − 2000) = 1/6.
Since y is 1/6 of the way from y1 to y2, we guess that the corresponding x is about 1/6 of the way from x1 to x2:
xguess = 40 + (1/6) (100 − 40) = 40 + (1/6) 60 = 50
We therefore examine y[xguess] = y[50] and compare it to the target y. If they are equal, we have succeeded; if not, we reduce the x range as in binary search and try again.
- (Conceivably, other than linear interpolation could be used in some situations.)
Performance (search time):
- Best case: Θ(1). When does this occur?
- Average case: Θ(log(log N)). Just believe it. How good is this?
- Worst case: Θ(N). When does this occur?
Discussion: how is interpolation search similar to binary search? How is it different?
Implementation of Interpolation Search
See textbook code listings. File: Search.java
b. Lower Bound on Comparison Sorting
- What is comparison sorting? Sorting that only examines the data by comparing them to each other.
- All of these are comparison sorts: insertion, merge, and quick sort.
- For N elements there are N! possible orderings. If we compare two elements …. [need to fill in the details of the argument]
- Conclusion: the best we can do with comparison sorting is Θ(N log N).
- We already know that quicksort is Θ(N log N) in the average case, and merge sort is Θ(N log N) in the worst case; hence, these two algorithms are in the same order as the fastest possible comparison sorting algorithm.
c. Bucket Sort
- But then is comparison sorting the only way to sort?
- Bucket sort is not a comparison sort.
- As with interpolation search, this works best when the data to be sorted are uniformly distributed, i.e., any range contains approximately the same number of data values.
- In addition, we "must know" the lower and upper bounds of the data, datamin and datamax.
- How to do it:
- Create an array of k buckets representing equal-size intervals between datamin and datamax.
- Each bucket is a collection which can store multiple data values, and has a lower bound bucketmin and an upper bound bucketmax (different for each bucket).
- Choose k large enough so that the average number of values stored in a bucket, n/k, will be less than or equal to a small constant b.
- For each data value, store the value in the bucket to which it belongs (i.e., so that bucketmin ≤ value < bucketmax).
- Now all the data values have been stored in buckets. Traverse the array of buckets: for each bucket, sort the data in the bucket and output the result.
- Since the buckets on average will have only small collections of values, we can use any sorting technique.
- Insertion sort works well with small collections.
- Example: sort [0, 100, 50, 40, 75, 80, 21, 33, 67, 99] using 5 buckets.
- The minimum and maximum values are 0 and 100. The range of each bucket is therefore (100 − 0 ) / 5 = 20. So, the buckets have the intervals (bucketmin, bucketmax) = (0, 20), (20, 40), (40, 60), (60, 80), (80, 100).
- This means, for example, the first bucket (0, 20) will get all values ≤ 20, and in the second bucket (20, 40), all values > 20 and ≤ 40.
- Note that we will use a direct computation to find the bucket for any value, bucket(x) = (x - 0) / 20 restricted to the range [0, 4]. (In this way we can avoid a long
if statement with many branches.) - Do it: …
- Analysis:
- Average case, for a collection of N elements, assuming equally distributed data:
- There are N steps to store the N elements in their buckets.
- There are k buckets, each with about b = N/k values.
- Sorting time for each bucket is Θ(b2) with insertion sort. There are k buckets, with k < N (why?), so the time to sort all the buckets is Θ(b2 N).
- The overall time for bucket sort, then, is Θ(N + b2 N), and since b is a constant, this is Θ(N).
- Worst case: if all the values go into the same bucket, then the time to sort this bucket with insertion sort is Θ(N2). The time to insert into the buckets and sort the other buckets is insignificant, so the overall time is Θ(N2).
- Extensions: If we don't know the lower and upper bounds of the data, we can either:
- estimate them, and allow the first and last buckets to contain all values below and above the estimated bounds, respectively; or
- traverse the collection (time Θ(N)) to actually find the minimum and maximum values. Using (a) will distort the assumption that all buckets get about the same number of values, if the estimates are very bad, and so may lead to worst case performance.
- Discussion:
- How is bucket sort similar to interpolation search?
- How is it similar to a hashtable?
- How is it like quicksort?
Implementation of Bucket Sort
See textbook code listings.
Files: Search.java, SortableLinkedList.java.