Patricia Shanahan wrote:

> Lew wrote:

>>> On Tue, 11 Nov 2008 00:39:29 -0800, <(E-Mail Removed)> wrote:

>>>

>>

>> Please include a body to your posts.
He did -- he said " 3ks". Not intelligible, but it does seem to

constitute a body.

>> Peter Duniho wrote:

>>> I definitely would not want the hash codes to be 0 and 1.

>>

>> Why not?

>

> The Integer hashCode() is the primitive int the Integer represents.
Bletch!

That will have a pretty poor distribution of hash values, because the

integer values that arise in actual running code don't tend to be

completely randomly distributed. Integers richer in factors are more

common, as are smaller integers.

Multiplying the integer's value by a predetermined prime number near

(Integer.MAX_VALUE times 1.6180339

would have been a much better

choice of hash function, scattering the hash values pretty randomly

about the full range of integer values, varying a lot in each bit

position, whatever the input distribution. (Note that that multiplier

would actually be negative, due to wrap-around; technically, it wouldn't

be prime but its negation would be. 1.61803398 is close to the golden

ratio, the "most irrational" number, so producing a much more uniform

distribution that won't "resonate" with Integer.MAX_VALUE and produce

distinct clumps. Going near MAX_VALUE itself would produce clumps around

0 and around MAX_VALUE and -MAX_VALUE-1; going near 1/2 of MAX_VALUE,

there and at MAX_VALUE/2 and -MAX_VALUE/2; and so on. Given that the

input is biased towards small integers.)

Note that to the extent that the input just isn't very diverse, neither

will be the hash values, however distributed, but the farther apart the

distinct hash values tend to be, and the more evenly scattered, the

better hash tables will perform within the constraints on their

performance imposed by that low diversity.

Integer as a key to a hash table isn't a ridiculously unlikely case,

either -- it can be more efficient than an array if most of the array's

cells are going to be nulls or zeros, and it's going to be large, and it

can arise if you have heterogeneous keys some of which turn out to be

integers. A sparse vector is one possible case -- for a hash table to be

more efficient than an array for this job, the vector may need to have

hundreds of dimensions, but even then the integer keys are distributed

towards very small numbers in comparison to Integer.MAX_VALUE. Those

from zero to 999 comprise less than one part in four million of the

total range.

This is what the present hash function does to a sparse vector with ten

non-zero values out of a thousand, put in a hash table with 16 buckets

(more than you get with a 75% load factor, 13).

If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,

and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.

Quite a few hash collisions.

It gets worse if there's a tendency for the integers used as keys to

have biased low-order bits. It can get much worse.