Velocity Reviews > Java > why in class Boolean, hashcode() of "true" is 1231 and of "false" is1237?

# why in class Boolean, hashcode() of "true" is 1231 and of "false" is1237?

public boolean
Guest
Posts: n/a

 11-11-2008
Patricia Shanahan wrote:
> Lew wrote:
>>> On Tue, 11 Nov 2008 00:39:29 -0800, <(E-Mail Removed)> wrote:
>>>

>>
>> Please include a body to your posts.

He did -- he said " 3ks". Not intelligible, but it does seem to
constitute a body.

>> Peter Duniho wrote:
>>> I definitely would not want the hash codes to be 0 and 1.

>>
>> Why not?

>
> The Integer hashCode() is the primitive int the Integer represents.

Bletch!

That will have a pretty poor distribution of hash values, because the
integer values that arise in actual running code don't tend to be
completely randomly distributed. Integers richer in factors are more
common, as are smaller integers.

Multiplying the integer's value by a predetermined prime number near
(Integer.MAX_VALUE times 1.6180339 would have been a much better
choice of hash function, scattering the hash values pretty randomly
about the full range of integer values, varying a lot in each bit
position, whatever the input distribution. (Note that that multiplier
would actually be negative, due to wrap-around; technically, it wouldn't
be prime but its negation would be. 1.61803398 is close to the golden
ratio, the "most irrational" number, so producing a much more uniform
distribution that won't "resonate" with Integer.MAX_VALUE and produce
distinct clumps. Going near MAX_VALUE itself would produce clumps around
0 and around MAX_VALUE and -MAX_VALUE-1; going near 1/2 of MAX_VALUE,
there and at MAX_VALUE/2 and -MAX_VALUE/2; and so on. Given that the
input is biased towards small integers.)

Note that to the extent that the input just isn't very diverse, neither
will be the hash values, however distributed, but the farther apart the
distinct hash values tend to be, and the more evenly scattered, the
better hash tables will perform within the constraints on their
performance imposed by that low diversity.

Integer as a key to a hash table isn't a ridiculously unlikely case,
either -- it can be more efficient than an array if most of the array's
cells are going to be nulls or zeros, and it's going to be large, and it
can arise if you have heterogeneous keys some of which turn out to be
integers. A sparse vector is one possible case -- for a hash table to be
more efficient than an array for this job, the vector may need to have
hundreds of dimensions, but even then the integer keys are distributed
towards very small numbers in comparison to Integer.MAX_VALUE. Those
from zero to 999 comprise less than one part in four million of the
total range.

This is what the present hash function does to a sparse vector with ten
non-zero values out of a thousand, put in a hash table with 16 buckets
(more than you get with a 75% load factor, 13).

If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,
and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.

Quite a few hash collisions.

It gets worse if there's a tendency for the integers used as keys to
have biased low-order bits. It can get much worse.

Lew
Guest
Posts: n/a

 11-12-2008
Patricia Shanahan wrote:
> Peter Duniho wrote:
>> On Tue, 11 Nov 2008 13:48:11 -0800, Lew <(E-Mail Removed)> wrote:
>>
>>> Lew wrote:
>>>>>> How often are Boolean hashes compared to Integer hashes?
>>>>> Answering my own question: as often as Integer and Boolean members
>>>>> appear together in a class for which one must design a hash code that
>>>>> distributes well over arbitrary-size hash lists.
>>>
>>> "Peter Duniho" wrote:
>>>> I'm not sure what you mean there. Ideally, just because a class has
>>>> both Integer and Boolean members, that wouldn't mean you'd actually
>>>> be _comparing_ Integer hashes to Boolean hashes.
>>>
>>> What I mean there is that comparison is not relevant, composition is.

>>
>> Sorry, still confused here.
>>
>> Your original question: "How often are Boolean hashes compared to
>> Integer hashes?"
>>
>> My answer to that question would be, "never, assuming a correct use of
>> hash codes".
>>
>> Yet, you compare that to this predicate: "as often as Integer and
>> Boolean members appear together in a class for which one must design a
>> hash code that distributes well over arbitrary-size hash lists".

I understand your confusion. I focused on comparison early on, and agree with
you that comparison between Booleans and Integers should be fairly rare,
occurring in the relatively uncommon situations that Patricia mentioned.
However, I then realized that comparison is not the only situation where you
need hash codes; composition is also relevant.

In that non-comparative context, where you compose the hash codes of several
class elements to get an object's overall hash, it would be useless for a
common non-null element value (say, FALSE) to hash to zero, and not very
useful to hash to one. Thus some clever person decided not to use zero and
one as the hash codes of Boolean.

--
Lew

Mark Space
Guest
Posts: n/a

 11-12-2008
public boolean wrote:

> If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,
> and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.
>
> Quite a few hash collisions.

Choose 10 random numbers between 1 and 16. How many will be duplicates?

I don't see how this arbitrary list of number is somehow significant in
proving that the hashcodes from Integer are poorly distributed.

Eric Sosman
Guest
Posts: n/a

 11-12-2008
public boolean wrote:
> Patricia Shanahan wrote:
>> Lew wrote:
>>>> On Tue, 11 Nov 2008 00:39:29 -0800, <(E-Mail Removed)> wrote:
>>>>
>>>
>>> Please include a body to your posts.

>
> He did -- he said " 3ks". Not intelligible, but it does seem to
> constitute a body.
>
>>> Peter Duniho wrote:
>>>> I definitely would not want the hash codes to be 0 and 1.
>>>
>>> Why not?

>>
>> The Integer hashCode() is the primitive int the Integer represents.

>
> Bletch!
>
> That will have a pretty poor distribution of hash values, because the
> integer values that arise in actual running code don't tend to be
> completely randomly distributed. Integers richer in factors are more
> common, as are smaller integers.

So? If you compute a hash code h(k) for each Integer,
then whenever k is a frequently-occurring Integer value, h(k)
will be a frequently-occurring hash value. At the best you
can achieve a permutation of the Integer values; if not so lucky
you might choose an h() that maps the entire Integer domain into
a smaller range. In no event will any conceivable h() be an
improvement over h(k)=k.intValue(). h(k1) will collide with h(k2)
exactly as often as k1 and k2 themselves collide.

> Multiplying the integer's value by a predetermined prime number near
> (Integer.MAX_VALUE times 1.6180339 would have been a much better
> choice of hash function, scattering the hash values pretty randomly
> about the full range of integer values, varying a lot in each bit
> position, whatever the input distribution.

Nonsense. If the input consists of 100 zeroes, 10 ones,
and one each of 2..9, the hashes will consist of 100 h(0)s,
10 h(1)s, and one each of h(2..9). Nothing has changed,
except possibly for the worse if h(0)==h(6), say.

> [... more pseudo-numerological nonsense ...]

As shown above, an Integer hashCode() other than identity
could not improve the scattering of different Integer values
from each other. There is one conceivable case where another
hashCode() might help: if the keys of the Hash{Table,Map,Set}
are a mixture of Integers and other types[*], it is possible that
a carefully-chosen hashCode() for Integers might help separate
them from the BigInteger and String and ... keys' hashCode()s.
That sort of thing needs far too many assumptions about the key
population to warrant a general solution.
[*] Personally, I've only once found a use for such a mixed-
key table: The keys in that case were Integers, Longs, Doubles,
and Strings. I maintain that it is *not* worth while to try to
discover a suite of hashCode() implementations that would in
some sense "optimize" the spread of hash values; it was a corner
case, and hashCode() should cater more to the common case of a
homogeneous key population.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid

public boolean
Guest
Posts: n/a

 11-12-2008
Mark Space wrote:
> public boolean wrote:
>
>> If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807, 901,
>> and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and 5.
>>
>> Quite a few hash collisions.

>
>
> Choose 10 random numbers between 1 and 16. How many will be duplicates?
>
> I don't see how this arbitrary list of number is somehow significant in
> proving that the hashcodes from Integer are poorly distributed.

It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.

public boolean
Guest
Posts: n/a

 11-12-2008
Eric Sosman wrote:
> Nonsense. If the input consists of 100 zeroes, 10 ones,
> and one each of 2..9, the hashes will consist of 100 h(0)s,
> 10 h(1)s, and one each of h(2..9). Nothing has changed,
> except possibly for the worse if h(0)==h(6), say.
>
>> [... more pseudo-numerological nonsense ...]

Your insulting attitude in response to a perfectly civil post baffles me.

I'll copy and paste my reasoning from my reply to Mark Space, who had a
similar objection but was far more polite.

It's not just the distribution of the hash values over the space of
possible integer values that counts. It's the distribution of the hash
values over the actual hash buckets produced in an actual hash table,
which typically is much smaller. You can have few collisions in a large
space and still end up with many in a small space, especially if the
small space's size is a power of two and the low-order bits of the hash
are biased. Even integers are more common than odd, so in any hash table
with integer keys using the current integer hash code and with an even
number of buckets, it's likely the odd buckets will be underutilized and
the even ones will have an unnecessary collision or several.

Lew
Guest
Posts: n/a

 11-12-2008
public boolean wrote:
> Even integers are more common than odd,

How do you figure? I see an exact 50%-50% distribution.

--
Lew

Lew
Guest
Posts: n/a

 11-12-2008
public boolean wrote:
> Eric Sosman wrote:
>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>> except possibly for the worse if h(0)==h(6), say.
>>
>>> [... more pseudo-numerological nonsense ...]

>
> Your insulting attitude in response to a perfectly civil post baffles me.

Oh, Christ, it *is* Twisted again. I was afraid of that.

Quit hiding who you are, or are you embarrassed to admit it?

Re-plonk.

--
Lew

Eric Sosman
Guest
Posts: n/a

 11-12-2008
public boolean wrote:
> Eric Sosman wrote:
>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>> except possibly for the worse if h(0)==h(6), say.
>>
>>> [... more pseudo-numerological nonsense ...]

>
> Your insulting attitude in response to a perfectly civil post baffles me.

I apologize for the tone of my response, and wish I had
written less antagonistically. However, I stand by the
content.

> I'll copy and paste my reasoning from my reply to Mark Space, who had a
> similar objection but was far more polite.
>
> It's not just the distribution of the hash values over the space of
> possible integer values that counts. It's the distribution of the hash
> values over the actual hash buckets produced in an actual hash table,
> which typically is much smaller. You can have few collisions in a large
> space and still end up with many in a small space, especially if the
> small space's size is a power of two and the low-order bits of the hash
> are biased. Even integers are more common than odd, so in any hash table
> with integer keys using the current integer hash code and with an even
> number of buckets, it's likely the odd buckets will be underutilized and
> the even ones will have an unnecessary collision or several.

"Even integers are more common than odd" is a claim that
fails to convince me. Have you instrumented a suite of programs
to count the abundance of even and odd Integer keys in maps?
Or can you provide a link to such a result?

But even if it were true, I do not see how it could matter.
Have you studied the implementation of java.util.HashMap? You
will find that the first thing it does with a hashCode() value
is to perturb it as follows:

static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}

That is, bit 0 of the effective hash value depends on all of
bits 0, 4, 7, 12, 16, 19, 20, 24, and 27 of the hashCode().
The choice of even-numbered or odd-numbered bucket depends
on nine bits of the hashCode(), not on bit 0 alone.

The older java.util.Hashtable does not stir the bits this
way -- but then, Hashtable does not use power-of-two bucket
counts as HashMap does. (You can specify a power of two as
the initial Hashtable capacity, if you like, but after any
expansion the bucket count will be odd, not even.)

Again, I apologize for the tone of my earlier message.
But I still think your arguments are wide of the mark.

--
Eric Sosman
(E-Mail Removed)lid

Mark Space
Guest
Posts: n/a

 11-12-2008
public boolean wrote:

> It's not just the distribution of the hash values over the space of
> possible integer values that counts. It's the distribution of the hash
> values over the actual hash buckets produced in an actual hash table,
> which typically is much smaller. You can have few collisions in a large
> space and still end up with many in a small space, especially if the
> small space's size is a power of two and the low-order bits of the hash
> are biased. Even integers are more common than odd, so in any hash table
> with integer keys using the current integer hash code and with an even
> number of buckets, it's likely the odd buckets will be underutilized and
> the even ones will have an unnecessary collision or several.

I just want to point out that Java's hash map doesn't use a straight
mapping between object.hashCode() and the hash index values. Here's the
correct algorithm, for a hash table size of 16:

static int[] testHash2 = {0, 2, 4, 6, 8, 10, 12,
14, 16, 18, 20, 22, 24 };

ArrayList<Integer> hashResults = new ArrayList<Integer>();
final int LENGTH = 16;
for( int i : testHash2 ) {
int h = new Integer( i ).hashCode();
System.out.print( Integer.toHexString( h )+", " );
h ^= (h >>> 20) ^ (h >>> 12);
h = h ^ (h >>> 7) ^ (h >>> 4);
hashResults.add( h & (LENGTH - 1) );
}
System.out.println( );
System.out.println( hashResults );

Output:

0, 2, 4, 6, 8, a, c, e, 10, 12, 14, 16, 18,
[0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9]