On Thu, 2 Dec 2010, Patricia Shanahan wrote:
> On 12/2/2010 1:09 AM, bugbear wrote:
>> Tom Anderson wrote:
>>
>>> Firstly, you may be interested in this paper:
>>>
>>> http://www.siam.org/proceedings/alen...07sandersp.pdf
>>>
>>> Which says that although Baeza-Yates is very clever, it actually isn't
>>> that fast in practice. The algorithm they give which handily beats it is
>>> very badly explained (they fall into horrible computer science notation
>>> to explain what is very much a programmer's algorithm)
>>
>> I'm not I understand your notion of a conflict between "computer science"
>> and "programmer".
>
> Maybe the distinction Tom is making is between algorithms that are
> primarily of interest to computer scientists studying issues such as the
> minimum computational complexity for a specific problem, and those
> algorithms that are useful in practical programs.
Something along those lines. It's more that the algorithm they describe is
not very intellectually interesting - it boils down to iterating over a
list of numbers stored in a two-level array using the obvious pair of
nested loops. There's no deep insight, clever tricks, exploitation of
mathematical truths, relation to a more general space of problems, etc.
Just a couple of loops that actually run pretty fast. It's the kind of
algorithm an experienced programmer who knew a bit about the memory
hierarchy could write without any background in computer science (i
consider memory hierarchy to be electronic engineering!).
FWIW, the data structure and algorithm in the paper is basically:
// assume all values in the lists are are positive or zero
// i will not use any bit compression, just store full-size ints
// this misses some of the point of the algorithm, but it's fiddly, and
// doesn't add anything to the explanation
// utility functions
final int LOW_BITS = 14; // or whatever
int lowBits(int i) {
return i & ((1 << LOW_BITS) - 1); // extract the bottom bits
}
int highBits(int i) {
return i >> LOW_BITS;
}
// i am using array chopping and changing for simplicity
// you probably wouldn't want to implement it like this
// both these functions are horrific
int[] subarray(int[] array, int start, int end) {
int[] subarray = new int[end - start];
System.arraycopy(array, start, subarray, 0, subarray.length);
return subarray;
}
int[] ensureCapacity(int[] array, int size) {
if (array.length >= size) return array;
int[] newArray = new int[size];
System.arraycopy(array, 0, newArray, 0, array.length);
return newArray;
}
// code to build the structure
int[] buildDeltaChain(int[] sortedNumbers) {
int[] chain = new int[sortedNumbers.length];
int prev = 0;
for (int i = 0; i < sortedNumbers.length; ++i) {
int cur = lowBits(sortedNumbers[i]);
n[i] = cur - prev;
prev = cur;
}
}
int[][] buildTwoLevelTable(int[] sortedNumbers) {
int start = 0;
int[][] deltaChains = new int[0][]; // 50% chance this is the wrong syntax, sorry
while (start < sortedNumbers.length) {
int highBits = highBits(sortedNumbers[start]);
int end = start + 1;
while (highBits(sortedNumbers[end]) == highBits) ++end;
int[] deltaChain = buildDeltaChain(subarray(sortedNumbers, start, end));
deltaChains = ensureCapacity(deltaChains, (highBits + 1));
deltaChains[highBits] = deltaChain;
start = end;
}
}
// code to intersect another sorted list with the structure
final int REMOVED = -1; // we will rub out anything not in the structure by overwriting it with this
public void intersect(int[] sortedNumbers) {
int currentChainHighBits = -1;
int[] currentChain = null;
int currentChainIndex = -1; // always points to the next element to look at
int currentChainValue = -1;
for (int i = 0; i < sortedNumbers.length; ++i) {
int n = sortedNumbers[i];
int highBits = highBits(n);
if (highBits != currentChainHighBits) {
// move to the head of a new chain
currentChainHighBits = highBits;
currentChain = deltaChains[highBits];
currentChainIndex = 0;
currentChainValue = currentChain[0];
}
int lowBits = lowBits(n);
while (currentChainValue < lowBits) {
++currentChainIndex;
if (currentChainIndex >= currentChain.length) {
currentChainValue = Integer.MAX_VALUE; // this will make following queries on this chain finish immediately
break;
}
currentChainValue = currentChainValue + currentChain[currentChainIndex];
}
if (currentChainValue != n) sortedNumbers[i] = REMOVED;
}
}
This is off the top of my head without looking at the paper again, so
there could well be syntax errors, bugs, and other deviances.
Thinking about it, you probably want to store all the delta chains in one
big array, as one big delta chain, and then have the upper level of the
structure be an index on it: for each value of highBits, it would store
the lowBits value and index in the chain of the first such element.
Anyway, my point is that if you look through the above code, there's
nothing terribly clever - just some fairly workmanlike looping over
arrays, with some delta encoding thrown in. It's a good and effective
algorithm, but it doesn't really *teach* us anything.
tom
--
I don't know what a ceilidh is and I dread to ask. I'm picturing some kind
of mechanical contraption with dildos attached to pistons. -- Jonathan M