Velocity Reviews > Java > Baeza-Yates intersection Java implementation

# Baeza-Yates intersection Java implementation

Sebastian
Guest
Posts: n/a

 12-01-2010
I am trying to implement the Baeza-Yates intersection algorithm, but
need some help.

This algorithm is among the fastest known algorithms for intersecting
sorted sets. For small m, Baeza-Yates is O(m * log(n)), while for n =
alpha * m it is O(n), i. e. it performs particularly well when the
one set is sufficiently larger than the other.

E. g., see "Experimental Analysis of a Fast Intersection Algorithm for
Sorted Sequences" by Ricardo Baeza-Yates and Alejandro Salinger

http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

There is a C implementation by Erik Frey, see
http://fawx.com/
http://github.com/erikfrey/themas/tr..._intersection/

Unfortunately, I cannot read C very well. Below is my attempt to come
up with a Java version, which does not work (it generates stack
overflows, and my attempts to repair that led to other incorrect
behavior. Also, lowerBound should probably use a binary search).

Perhaps it would be worthwhile to put together a working version
and post it somewhere for everyone's benefit.

-- Sebastian

import java.util.ArrayList;
import java.util.List;
import java.util.Comparator;

public class Intersection

/**
* Finds the intersection of two sorted random access lists.
* @param <T> type of the list elements
* @param list1 a random access list of T sorted by cmp
* @param list2 a random access list of T sorted by cmp
* @param cmp a Comparator for T's
* @return a random access list of T sorted by cmp containing the
intersection of list1 and list2
*/
public static <T> List<T> intersect( List<T> list1, List<T> list2,
Comparator<T> cmp )
{
int size1 = list1.size();
int size2 = list2.size();
int maxSize = Math.max( size1, size2 );
return baeza_intersect( list1, 0, (size1 - 1), list2, 0, (size2 -
1), cmp, new ArrayList<T>( maxSize ) );
}

private static <T> List<T> baeza_intersect( List<T> list1, int
begin1, int end1, List<T> list2, int begin2, int end2,
Comparator<T> cmp,
List<T> out )
{
if( (end1 - begin1) <= (end2 - begin2) ) {
if( begin1 > end1 ) {
return out;
}
// binary probe (could use interpolation probe if there were a
// distance measure on T
int probe1 = begin1 + ((end1 - begin1) / 2);
int probe2 = lowerBound( list2, begin2, end2, list1.get( probe1
), cmp );
if( probe1 < end1 || probe1 < end2 ) {
baeza_intersect( list1, begin1, probe1, list2, begin2, probe2,
cmp, out );
}
if( (probe2 != end2) &&(cmp.compare( list1.get( probe1 ),
list2.get( probe2 ) ) == 0) ) {
}
probe1++;
probe2++;
return baeza_intersect( list1, probe1, end1, list2, probe2, end2,
cmp, out );
}
else {
return baeza_intersect( list2, begin2, end2, list1, begin1, end1,
cmp, out );
}
}

private static <T> int lowerBound( List<T> list, int begin, int end,
T value, Comparator<T> cmp )
{
// should use binary search?
int i = begin;

while( (i < end) && (cmp.compare( list.get( i ), value ) < 0) ) {
i++;
}

return i;
}
}

Sebastian
Guest
Posts: n/a

 12-01-2010
sorry, there's typo in the OP. The condition enclosing the

if( probe1 < end1 || probe2 < end2 ) {

That gets rid of the stack overflow, but it's still incorrect.

Consider what happens when the only element in the intersection
should be the last element of list2, and probe2 == end2. That
case is not covered, but removing the check probe2 != end2 before
adding an element to the output may lead to the same element being
included twice.

-- Sebastian

Tom Anderson
Guest
Posts: n/a

 12-01-2010
On Wed, 1 Dec 2010, Sebastian wrote:

> I am trying to implement the Baeza-Yates intersection algorithm, but
> need some help.
>
> This algorithm is among the fastest known algorithms for intersecting
> sorted sets. For small m, Baeza-Yates is O(m * log(n)), while for n = alpha *
> m it is O(n), i. e. it performs particularly well when the
> one set is sufficiently larger than the other.
>
> E. g., see "Experimental Analysis of a Fast Intersection Algorithm for Sorted
> Sequences" by Ricardo Baeza-Yates and Alejandro Salinger
>
> http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
>
> There is a C implementation by Erik Frey, see
> http://fawx.com/
> http://github.com/erikfrey/themas/tr..._intersection/
>
> Unfortunately, I cannot read C very well. Below is my attempt to come
> up with a Java version, which does not work (it generates stack
> overflows, and my attempts to repair that led to other incorrect
> behavior. Also, lowerBound should probably use a binary search).

I hadn't come across this algorithm before, so thanks for bringing it to
my attention.

Firstly, you may be interested in this paper:

http://www.siam.org/proceedings/alen...07sandersp.pdf

Which says that although Baeza-Yates is very clever, it actually isn't
that fast in practice. The algorithm they give which handily beats it is
very badly explained (they fall into horrible computer science notation to
explain what is very much a programmer's algorithm), but quite simple. The
key point, i think, is that if you stick to a fairly simple
iterate-over-the-lists approach, you can use a modestly clever compressed
representation which lets you get the result faster than using an
ingenious but complex algorithm on an uncompressed representation.

That algorithm is written in terms of intersecting sets of numbers, but i
think most of what they do can be applied to sets of comparable objects -
you can't use the bucketed structure they describe, but you can do
something pretty similar without compression. Or you could use hashcodes,
and use exactly their algorithm.

Which is not to say BY isn't an interesting thing to implement anyway, of
course.

Secondly, i think you might have better luck by starting with the theory
and writing your own implementation than by trying to read Mr Frey's code.
I'm sure his code is very good, but writing is often easier than
translating. It's certainly easier to know where your head is at while
doing it.

Thirdly, i think you might be interested in the subList method of List.
Specifically, because it will let you write the recursive method in terms
of whole lists, rather than having to juggle bounds and offsets and so on.
I imagine some simple but hard to spot mistake in those bounds and offsets
is behind the infinite loop which is causing your stack overflow.

Fourthly, i don't have anything specific to say on your implementation -
i'm afraid to say i don't have time to get into the details of debugging
someone else's code right now. I urge you to have a go with subList,
though, and see if it makes things clearer.

tom

--
I have often thought that the questions which we were unable to ask and
the speeches that we never had a chance to deliver were probably the
best ones. -- John Wilkinson MP, on Parliament

Tom Anderson
Guest
Posts: n/a

 12-02-2010
On Thu, 2 Dec 2010, bugbear wrote:

> Tom Anderson wrote:
>
>> Firstly, you may be interested in this paper:
>>
>> http://www.siam.org/proceedings/alen...07sandersp.pdf
>>
>> Which says that although Baeza-Yates is very clever, it actually isn't
>> that fast in practice. The algorithm they give which handily beats it is
>> very badly explained (they fall into horrible computer science notation
>> to explain what is very much a programmer's algorithm)

>
> I'm not I understand your notion of a conflict between "computer science"
> and "programmer".

There are more greek letters and subscripts in computer science.

tom

--
Remember when we said there was no future? Well, this is it.

Leonardo Azpurua
Guest
Posts: n/a

 12-03-2010

"bugbear" <bugbear@trim_papermule.co.uk_trim> escribió en el mensaje
news:(E-Mail Removed) o.uk...
> Tom Anderson wrote:
>
>>
>> Firstly, you may be interested in this paper:
>>
>> http://www.siam.org/proceedings/alen...07sandersp.pdf
>>
>> Which says that although Baeza-Yates is very clever, it actually isn't
>> that fast in practice. The algorithm they give which handily beats it is
>> very badly explained (they fall into horrible computer science notation
>> to explain what is very much a programmer's algorithm)

>
> I'm not I understand your notion of a conflict between "computer science"
> and "programmer".

It is not much of a conflict.

In fact, many programmers began their careers as computer scientists.

Tom Anderson
Guest
Posts: n/a

 12-04-2010
On Thu, 2 Dec 2010, Patricia Shanahan wrote:

> On 12/2/2010 1:09 AM, bugbear wrote:
>> Tom Anderson wrote:
>>
>>> Firstly, you may be interested in this paper:
>>>
>>> http://www.siam.org/proceedings/alen...07sandersp.pdf
>>>
>>> Which says that although Baeza-Yates is very clever, it actually isn't
>>> that fast in practice. The algorithm they give which handily beats it is
>>> very badly explained (they fall into horrible computer science notation
>>> to explain what is very much a programmer's algorithm)

>>
>> I'm not I understand your notion of a conflict between "computer science"
>> and "programmer".

>
> Maybe the distinction Tom is making is between algorithms that are
> primarily of interest to computer scientists studying issues such as the
> minimum computational complexity for a specific problem, and those
> algorithms that are useful in practical programs.

Something along those lines. It's more that the algorithm they describe is
not very intellectually interesting - it boils down to iterating over a
list of numbers stored in a two-level array using the obvious pair of
nested loops. There's no deep insight, clever tricks, exploitation of
mathematical truths, relation to a more general space of problems, etc.
Just a couple of loops that actually run pretty fast. It's the kind of
algorithm an experienced programmer who knew a bit about the memory
hierarchy could write without any background in computer science (i
consider memory hierarchy to be electronic engineering!).

FWIW, the data structure and algorithm in the paper is basically:

// assume all values in the lists are are positive or zero

// i will not use any bit compression, just store full-size ints
// this misses some of the point of the algorithm, but it's fiddly, and
// doesn't add anything to the explanation

// utility functions

final int LOW_BITS = 14; // or whatever

int lowBits(int i) {
return i & ((1 << LOW_BITS) - 1); // extract the bottom bits
}

int highBits(int i) {
return i >> LOW_BITS;
}

// i am using array chopping and changing for simplicity
// you probably wouldn't want to implement it like this
// both these functions are horrific

int[] subarray(int[] array, int start, int end) {
int[] subarray = new int[end - start];
System.arraycopy(array, start, subarray, 0, subarray.length);
return subarray;
}

int[] ensureCapacity(int[] array, int size) {
if (array.length >= size) return array;
int[] newArray = new int[size];
System.arraycopy(array, 0, newArray, 0, array.length);
return newArray;
}

// code to build the structure

int[] buildDeltaChain(int[] sortedNumbers) {
int[] chain = new int[sortedNumbers.length];
int prev = 0;
for (int i = 0; i < sortedNumbers.length; ++i) {
int cur = lowBits(sortedNumbers[i]);
n[i] = cur - prev;
prev = cur;
}
}

int[][] buildTwoLevelTable(int[] sortedNumbers) {
int start = 0;
int[][] deltaChains = new int[0][]; // 50% chance this is the wrong syntax, sorry
while (start < sortedNumbers.length) {
int highBits = highBits(sortedNumbers[start]);
int end = start + 1;
while (highBits(sortedNumbers[end]) == highBits) ++end;
int[] deltaChain = buildDeltaChain(subarray(sortedNumbers, start, end));
deltaChains = ensureCapacity(deltaChains, (highBits + 1));
deltaChains[highBits] = deltaChain;
start = end;
}
}

// code to intersect another sorted list with the structure

final int REMOVED = -1; // we will rub out anything not in the structure by overwriting it with this

public void intersect(int[] sortedNumbers) {
int currentChainHighBits = -1;
int[] currentChain = null;
int currentChainIndex = -1; // always points to the next element to look at
int currentChainValue = -1;
for (int i = 0; i < sortedNumbers.length; ++i) {
int n = sortedNumbers[i];
int highBits = highBits(n);
if (highBits != currentChainHighBits) {
// move to the head of a new chain
currentChainHighBits = highBits;
currentChain = deltaChains[highBits];
currentChainIndex = 0;
currentChainValue = currentChain[0];
}
int lowBits = lowBits(n);
while (currentChainValue < lowBits) {
++currentChainIndex;
if (currentChainIndex >= currentChain.length) {
currentChainValue = Integer.MAX_VALUE; // this will make following queries on this chain finish immediately
break;
}
currentChainValue = currentChainValue + currentChain[currentChainIndex];
}
if (currentChainValue != n) sortedNumbers[i] = REMOVED;
}
}

This is off the top of my head without looking at the paper again, so
there could well be syntax errors, bugs, and other deviances.

Thinking about it, you probably want to store all the delta chains in one
big array, as one big delta chain, and then have the upper level of the
structure be an index on it: for each value of highBits, it would store
the lowBits value and index in the chain of the first such element.

Anyway, my point is that if you look through the above code, there's
nothing terribly clever - just some fairly workmanlike looping over
arrays, with some delta encoding thrown in. It's a good and effective
algorithm, but it doesn't really *teach* us anything.

tom

--
I don't know what a ceilidh is and I dread to ask. I'm picturing some kind
of mechanical contraption with dildos attached to pistons. -- Jonathan M

BGB
Guest
Posts: n/a

 12-04-2010
On 12/2/2010 2:09 AM, bugbear wrote:
> Tom Anderson wrote:
>
>>
>> Firstly, you may be interested in this paper:
>>
>> http://www.siam.org/proceedings/alen...07sandersp.pdf
>>
>> Which says that although Baeza-Yates is very clever, it actually isn't
>> that fast in practice. The algorithm they give which handily beats it is
>> very badly explained (they fall into horrible computer science notation
>> to explain what is very much a programmer's algorithm)

>
> I'm not I understand your notion of a conflict between "computer science"
> and "programmer".
>

many "computer science" people are more math-heads than programmers.

they may know their way around their derivatives and integrals really
well, but fall on their face when it comes to writing any code in an
actual programming language.

many programmers can write piles of code, but fail to have any
understanding of all this theory, and despite beating together a
generally working rigid body physics engine, and later bomb hard core in
Physics and Calculus classes because of an inability to really get their
heads around set-theory and continuous systems (or much of anything in
math-land much beyond algebra or pre-algebra...).

a computer science person will describe type-systems with elaborate
notations (with so many greek letters), and promote concepts such as the
Hindley-Milner type system.

the actual programmer may write a compiler, generally representing types
as objects containing the relevant pieces of information (base-type
number, number of array levels, ...), and figure "well, types come from
the declarations and flow from inwards to outwards".

a computer scientist may want to eliminate all type declarations since
they feel they are ugly and obscure the logic of the program or hurt
generality or similar.

the programmer may like the type declarations since it makes it more
obvious which things go in which arguments and what one gets back, and
because it makes the compiler start more readily spewing error messages
if they mess up (vs silently accepting the program and doing something
strange as a result), ...

actually, the computer scientist may actually end up thinking using
Haskell to implement software is a good idea, and a programmer may
conclude (after looking at or trying to use the language) that in fact
it would be a very bad idea...

"them be strange folk who hate the nulls and the assignments... them
rather use them their unions and infinite recursions instead.".

....

or such...