Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > hash() yields different results for different platforms

Reply
Thread Tools

hash() yields different results for different platforms

 
 
Qiangning Hong
Guest
Posts: n/a
 
      07-11-2006
I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
hash function depend on machine's word length?

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?

I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here.

 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      07-12-2006
"Qiangning Hong" <(E-Mail Removed)> writes:
> However, when I come to Python's builtin hash() function, I found it
> produces different values in my two computers! In a pentium4,
> hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
> hash function depend on machine's word length?


The hash function is unspecified and can depend on anything the
implementers feel like. It may(?) even be permitted to differ from
one run of the interpreter to another (I haven't checked the spec for
this). Don't count on it being consistent from machine to machine.

> If it does, I must consider another hash algorithm because the spider
> will run concurrently in several computers, some are 32-bit, some are
> 64-bit. Is md5 a good choice? Will it be too slow that I have no
> performance gain than using the "url" column directly as the unique key?


If you're going to accept the overhead of an SQL database you might as
well enjoy the use of the abstraction it gives you, instead of trying
to implement what amounts to your own form of indexing instead of
letting the db take care of it. But md5(url) is certainly very fast
compared with processing the outgoing http connection that you
presumably plan to open for each url.

> I will do some benchmarking to find it out.


That's the right way to answer questions like this.
 
Reply With Quote
 
 
 
 
Grant Edwards
Guest
Posts: n/a
 
      07-12-2006
On 2006-07-11, Qiangning Hong <(E-Mail Removed)> wrote:

> I'm writing a spider. I have millions of urls in a table (mysql) to
> check if a url has already been fetched. To check fast, I am
> considering to add a "hash" column in the table, make it a unique key,
> and use the following sql statement:
> insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> to add new url.
>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation. Right?


I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

> However, when I come to Python's builtin hash() function, I
> found it produces different values in my two computers! In a
> pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
> 12416037344. Does hash function depend on machine's word
> length?


Apparently.

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>> hex(12416037344)

'0x2E40DB1E0L'
>>> hex(-468864544 & 0xffffffffffffffff)

'0xFFFFFFFFE40DB1E0L'

>>> hex(12416037344 & 0xffffffff)

'0xE40DB1E0L'
>>> hex(-468864544 & 0xffffffff)

'0xE40DB1E0L'

--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!
 
Reply With Quote
 
Carl Banks
Guest
Posts: n/a
 
      07-12-2006
Grant Edwards wrote:
> On 2006-07-11, Qiangning Hong <(E-Mail Removed)> wrote:
>
> > I'm writing a spider. I have millions of urls in a table (mysql) to
> > check if a url has already been fetched. To check fast, I am
> > considering to add a "hash" column in the table, make it a unique key,
> > and use the following sql statement:
> > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> > to add new url.
> >
> > I believe this will be faster than making the "url" column unique key
> > and doing string comparation. Right?

>
> I doubt it will be significantly faster. Comparing two strings
> and hashing a string are both O(N).


Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search. Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings; but I
expect most decent sql implementations already hash data internally, so
rolling your own hash would be useless at best.

If the OP's database is lacking, md5 is probably fine. Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions. Probably a good trade off unless the DB is
humungous.


Carl Banks

 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      07-12-2006
On 11 Jul 2006 16:33:50 -0700, "Qiangning Hong" <(E-Mail Removed)>
declaimed the following in comp.lang.python:

>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation. Right?
>

Given that the queries pass the data as strings, you're trading an
integer->string, string->integer conversion pair on each search,
followed by an index file look-up, for a straight index file look-up.

Granted, my college days were a quarter century ago, but at the
time, the data structures class required implementing a hash table
look-up program (hashed-head, multiple-linked list). Hashes weren't
guaranteed to be unique -- the hash only gave an entry point from which
one then performed string comparisons. *

You could maybe use MD5 encoding if you don't want the pure URL in
the look-up, and MySQL, at least, has a built-in for MD5 so you wouldn't
even have to worry about "differences".


* The Amiga file-system was a hashed-head, multiple-linked list -- the
only place I'd ever encountered one outside of that 3rd year college
course. File-names were hashed into an index into a directory block; the
directory block had pointers to file-header/subdirectory blocks... those
blocks then had a linked list to other fh/sd blocks that shared the hash
value.
--
Wulfraed Dennis Lee Bieber KD6MOG
http://www.velocityreviews.com/forums/(E-Mail Removed) (E-Mail Removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (E-Mail Removed))
HTTP://www.bestiaria.com/
 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      07-12-2006
On 11 Jul 2006 21:37:46 -0700, "Carl Banks"
<(E-Mail Removed)> declaimed the following in
comp.lang.python:


>
> If the OP's database is lacking, md5 is probably fine. Perhaps using a
> subset of the md5 (the low 32 bits, say) could speed up comparisons at
> risk of more collisions. Probably a good trade off unless the DB is
> humungous.
>

And if there are collisions, a unique index on the hash is not
usable...
--
Wulfraed Dennis Lee Bieber KD6MOG
(E-Mail Removed) (E-Mail Removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (E-Mail Removed))
HTTP://www.bestiaria.com/
 
Reply With Quote
 
Qiangning Hong
Guest
Posts: n/a
 
      07-12-2006
Grant Edwards wrote:
> On 2006-07-11, Qiangning Hong <(E-Mail Removed)> wrote:
> > However, when I come to Python's builtin hash() function, I
> > found it produces different values in my two computers! In a
> > pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
> > 12416037344. Does hash function depend on machine's word
> > length?

>
> Apparently.
>
> The low 32 bits match, so perhaps you should just use that
> portion of the returned hash?
>
> >>> hex(12416037344)

> '0x2E40DB1E0L'
> >>> hex(-468864544 & 0xffffffffffffffff)

> '0xFFFFFFFFE40DB1E0L'
>
> >>> hex(12416037344 & 0xffffffff)

> '0xE40DB1E0L'
> >>> hex(-468864544 & 0xffffffff)

> '0xE40DB1E0L'


Is this relationship (same low 32 bits) guaranteed? Will it change in
the future version?

 
Reply With Quote
 
Tim Peters
Guest
Posts: n/a
 
      07-12-2006
[Grant Edwards]
>> ...
>> The low 32 bits match, so perhaps you should just use that
>> portion of the returned hash?
>>
>> >>> hex(12416037344)

>> '0x2E40DB1E0L'
>> >>> hex(-468864544 & 0xffffffffffffffff)

>> '0xFFFFFFFFE40DB1E0L'
>>
>> >>> hex(12416037344 & 0xffffffff)

>> '0xE40DB1E0L'
>> >>> hex(-468864544 & 0xffffffff)

>> '0xE40DB1E0L'


[Qiangning Hong]
> Is this relationship (same low 32 bits) guaranteed?


No. Nothing about hashes is guaranteed, except that when x and y are
of a hashable type, and x == y, then hash(x) == hash(y) too.

> Will it change in the future version?


That's possible, but not planned. Note that the guts of string
hashing in CPython today is implemented via

while (--len >= 0)
x = (1000003*x) ^ *p++;

where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      07-12-2006
Qiangning Hong wrote:

> /.../ add a "hash" column in the table, make it a unique key


at this point, you should have slapped yourself on the forehead, and gone
back to the drawing board.

</F>



 
Reply With Quote
 
Nick Vatamaniuc
Guest
Posts: n/a
 
      07-12-2006
Using Python's hash as column in the table might not be a good idea.
You just found out why. So you could instead just use the base url and
create an index based on that so next time just quickly get all urls
from same base address then do a linear search for a specific one, or
even easier, implement your own hashes without using any of the
Python's built-in hash() functions. For example, transform each
character to an int and multply them all mod 2^32-1 or something like
that. Even better I think someone already posted the Python's way of
generating hashes for string, well, just re-implement it in Python such
that your version will yield the same hash on any platform.

Hope this helps,
Nick V.

Qiangning Hong wrote:
> I'm writing a spider. I have millions of urls in a table (mysql) to
> check if a url has already been fetched. To check fast, I am
> considering to add a "hash" column in the table, make it a unique key,
> and use the following sql statement:
> insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> to add new url.
>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation. Right?
>
> However, when I come to Python's builtin hash() function, I found it
> produces different values in my two computers! In a pentium4,
> hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344. Does
> hash function depend on machine's word length?
>
> If it does, I must consider another hash algorithm because the spider
> will run concurrently in several computers, some are 32-bit, some are
> 64-bit. Is md5 a good choice? Will it be too slow that I have no
> performance gain than using the "url" column directly as the unique
> key?
>
> I will do some benchmarking to find it out. But while making my hands
> dirty, I would like to hear some advice from experts here.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting Floats to Strings yields erratic results Dirk T. Shelley C Programming 29 06-10-2011 10:49 PM
RE: hash() yields different results for different platforms Kerry, Richard Python 2 07-13-2006 01:16 PM
Problem with method that starts process, yields pid then yields return code x1 Ruby 11 12-06-2005 01:30 AM
double to int conversion yields strange results =?ISO-8859-1?Q?Bj=F8rn_Augestad?= C Programming 31 02-16-2005 08:39 PM
File System Search on an asp file yields not results Rafael Nenninger ASP General 2 11-01-2004 10:05 PM



Advertisments