Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   String Hashing Algorithms (http://www.velocityreviews.com/forums/t821694-string-hashing-algorithms.html)

Phrogz 05-10-2005 07:58 PM

String Hashing Algorithms
 
Summary
================================================== ===========
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.


Background
================================================== ===========
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.


Methodology
================================================== ===========
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.

I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.


The Results
================================================== ===========
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).

Here is the output of my test:
djb :: 99.8601 percent coverage (60 collisions out of 42884)
elf :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js :: 99.9044 percent coverage (41 collisions out of 42884)
ap :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs :: 100.0000 percent coverage (0 collisions out of 42884)


The Algorithms
================================================== ===========
module HashEm
SIGNEDSHORT = 0x7FFFFFFF

def self.rs( str, len=str.length )
a,b = 63689,378551
hash = 0
len.times{ |i|
hash = hash*a + str[i]
a *= b
}
hash & SIGNEDSHORT
end

def self.js( str, len=str.length )
hash = 1315423911
len.times{ |i|
hash ^= ( ( hash << 5 ) + str[i] + ( hash >> 2 ) )
}
hash & SIGNEDSHORT
end

def self.elf( str, len=str.length )
hash = 0
x = 0
len.times{ |i|
hash = (hash << 4) + str[i]
if (x = hash & 0xF0000000) != 0
hash ^= (x >> 24)
hash &= ~x
end
}
hash & SIGNEDSHORT
end

def self.bkdr( str, len=str.length )
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
len.times{ |i|
hash = ( hash*seed )+str[i]
}
hash & SIGNEDSHORT
end

def self.sdbm( str, len=str.length )
hash = 0
len.times{ |i|
hash = str[i] + ( hash << 6 ) + ( hash << 16 ) - hash
}
hash & SIGNEDSHORT
end

def self.djb( str, len=str.length )
hash = 5381
len.times{ |i|
hash = ((hash << 5) + hash) + str[i]
}
hash & SIGNEDSHORT
end

def self.ap( str, len=str.length )
hash = 0
len.times{ |i|
if (i & 1) == 0
hash ^= (hash << 7) ^ str[i] ^ (hash >> 3)
else
hash ^= ~( (hash << 11) ^ str[i] ^ (hash >> 5) )
end
}
hash & SIGNEDSHORT
end
end



[1] http://www.code-source.org/algorithm.php?id=62


Joel VanderWerf 05-10-2005 08:17 PM

Re: String Hashing Algorithms
 
Phrogz wrote:
> Summary
> ================================================== ===========
> A brief analysis of Ruby versions of a few string hashing algorithms,
> and my conclusions about their efficacy at preventing collisions.


Is ruby's own String#hash one of the ones you tested? If not, how does
it compare?



Phrogz 05-10-2005 09:24 PM

Re: String Hashing Algorithms
 
Joel VanderWerf wrote:
> Phrogz wrote:
> > A brief analysis of Ruby versions of a few string hashing

algorithms,
> > and my conclusions about their efficacy at preventing collisions.

>
> Is ruby's own String#hash one of the ones you tested? If not, how

does
> it compare?


Because I was interested in C algorithms, I didn't consider it. But, I
just ran it again, and ruby came out with the best of them:

ruby :: 100.0000 percent coverage (0 collisions out of 42884)


Joel VanderWerf 05-10-2005 09:32 PM

Re: String Hashing Algorithms
 
Phrogz wrote:
> Joel VanderWerf wrote:
>
>>Phrogz wrote:
>>
>>>A brief analysis of Ruby versions of a few string hashing

>
> algorithms,
>
>>>and my conclusions about their efficacy at preventing collisions.

>>
>>Is ruby's own String#hash one of the ones you tested? If not, how

>
> does
>
>>it compare?

>
>
> Because I was interested in C algorithms, I didn't consider it. But, I
> just ran it again, and ruby came out with the best of them:
>
> ruby :: 100.0000 percent coverage (0 collisions out of 42884)
>


It is coded in C, and looking at rb_str_hash() in string.h, I see that
it is one of HASH_ELFHASH, HASH_PERL, or a third (very simple) one.



Phrogz 05-10-2005 10:17 PM

Re: String Hashing Algorithms
 
How can I figure out which hash it's using? It's not the ELF hash,
since that was the worst performer of the lot. Searching the source for
HASH_PERL I don't see it #defined anywhere, but my C is really, really
rusty. Does this mean that it's using the 'fallback' hash algorithm?
Does it vary by build per platform?


Zed A. Shaw 05-10-2005 10:38 PM

Re: String Hashing Algorithms
 
On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
>
> Background
> ================================================== ===========
> At work, we have some (young) code which is currently doing a lot of
> string compares, which needs to be really fast. We're about to try
> using hashes of the strings to represent each string instead, but
> instead of a traditional hash table (which 'chains' collisions for the
> same key) we're going to require that no two (different) strings may
> share the same hash key.


Just out of curiosity, but have you considered some other
structures/algorithms which might be alternatives depending on your
usage? Off the top of my head I can think of:

* Trie -- Should find other strings really fast, but gets pretty big the
more strings you need to store. There's a C library for this at
http://www.octavian.org/cs/software.html
* PATRICIA -- Basically a compacted Trie which takes less space.
Couldn't find a library for this one.
* Suffix Array -- I have a Ruby binding for one of the faster C
libraries which I use in FastCST. The big advantage of a Suffix Array
is that you can store it so that you only need to calculate the suffix
array once. A suffix array is really more useful for matching and
exclusion.
* Suffix Tree -- There's really not much of a reason to use suffix trees
these days since newer suffix array construction algorithms are faster.
The main advantage of a suffix tree is that searching for a result can
be faster. The main disadvntages are that they are fat memory pigs.
* Bloom Filter -- These are not as acurate, but they can be fast as all
hell if you just want to match strings with some probability.

Anyway, just thought I'd throw out some alternatives to hash tables for
certain situations. I'd say if you need to match C++ keywords to a
stored index then take a look at the libtst Trie implementation. It's
quite fast for that application. If you just need to see if a certain
word is "included" or "excluded" than a suffix array could do that
really fast. Trick there is to build the suffix array by joining the
strings with a | character (or something else) between them. Nice thing
about a suffix array is that you can build it offline and then just load
it directly for the searching.

Zed A. Shaw
http://www.zedshaw.com/





Clifford Heath 05-10-2005 11:25 PM

Re: String Hashing Algorithms
 
Phrogz wrote:
> Summary


Good summary and test, Phrogz.

Don't forget that in many hashing algorithms, the cost of computing
the hash function far outweighs the occasional cost of traversing a
chain. If you need guaranteed lookup time, you need a "perfect hash"
(try Googling on that) but otherwise you just need a fast hash function
with an excellent spread and a low load factor. Neither of those is
hard to achieve.

We use linear hashing (note, this is not the same as linear open
addressing, a much older technique), which achieves excellent
performance with any size hash table. Unfortunately I have no
implementation I can publish.

Clifford Heath.

Yukihiro Matsumoto 05-10-2005 11:58 PM

Re: String Hashing Algorithms
 
Hi,

In message "Re: String Hashing Algorithms"
on Wed, 11 May 2005 07:20:26 +0900, "Phrogz" <gavin@refinery.com> writes:

|How can I figure out which hash it's using?

We are using the last one, unless specifically defined.

matz.



Charles Mills 05-11-2005 05:16 AM

Re: String Hashing Algorithms
 

Zed A. Shaw wrote:
> On Wed, 2005-05-11 at 05:00 +0900, Phrogz wrote:
> >
> > Background
> > ================================================== ===========
> > At work, we have some (young) code which is currently doing a lot

of
> > string compares, which needs to be really fast. We're about to try
> > using hashes of the strings to represent each string instead, but
> > instead of a traditional hash table (which 'chains' collisions for

the
> > same key) we're going to require that no two (different) strings

may
> > share the same hash key.

>
> Just out of curiosity, but have you considered some other
> structures/algorithms which might be alternatives depending on your
> usage? Off the top of my head I can think of:
>
> * Trie -- Should find other strings really fast, but gets pretty big

the
> more strings you need to store. There's a C library for this at
> http://www.octavian.org/cs/software.html
> * PATRICIA -- Basically a compacted Trie which takes less space.
> Couldn't find a library for this one.
> * Suffix Array -- I have a Ruby binding for one of the faster C
> libraries which I use in FastCST. The big advantage of a Suffix

Array
> is that you can store it so that you only need to calculate the

suffix
> array once. A suffix array is really more useful for matching and
> exclusion.
> * Suffix Tree -- There's really not much of a reason to use suffix

trees
> these days since newer suffix array construction algorithms are

faster.
> The main advantage of a suffix tree is that searching for a result

can
> be faster. The main disadvntages are that they are fat memory pigs.
> * Bloom Filter -- These are not as acurate, but they can be fast as

all
> hell if you just want to match strings with some probability.
>
> Anyway, just thought I'd throw out some alternatives to hash tables

for
> certain situations. I'd say if you need to match C++ keywords to a
> stored index then take a look at the libtst Trie implementation.

It's
> quite fast for that application. If you just need to see if a

certain
> word is "included" or "excluded" than a suffix array could do that
> really fast. Trick there is to build the suffix array by joining the
> strings with a | character (or something else) between them. Nice

thing
> about a suffix array is that you can build it offline and then just

load
> it directly for the searching.
>
> Zed A. Shaw
> http://www.zedshaw.com/


You may be interested in this project:
http://www.rubyforge.org/projects/judy

Has an implementation of something like what you describe as a PATRICIA
above and a sparse hash.

-Charlie


Martin Ankerl 05-11-2005 06:04 AM

Re: String Hashing Algorithms
 
> At work, we have some (young) code which is currently doing a lot of
> string compares, which needs to be really fast. We're about to try
> using hashes of the strings to represent each string instead, but
> instead of a traditional hash table (which 'chains' collisions for the
> same key) we're going to require that no two (different) strings may
> share the same hash key.


If you know all possible strings in advance, you might want to generate
a perfect hash:
http://burtleburtle.net/bob/hash/perfect.html

martinus


All times are GMT. The time now is 06:16 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.