Velocity Reviews > Java > why in class Boolean, hashcode() of "true" is 1231 and of "false" is1237?

# why in class Boolean, hashcode() of "true" is 1231 and of "false" is1237?

Lasse Reichstein Nielsen
Guest
Posts: n/a

 11-12-2008
Mark Space <(E-Mail Removed)> writes:

> public boolean wrote:
>
>> If the occupied positions are 14, 29, 33, 57, 193, 385, 772, 807,
>> 901, and 917, you get these hashes: 14, 13, 1, 9, 1, 1, 4, 7, 5, and
>> 5.
>> Quite a few hash collisions.

>
>
> Choose 10 random numbers between 1 and 16. How many will be duplicates?

On average it's approx. 2.4 collisions. Here we get three, which has a ~30%
chance of happening.
I.e., this really is nothing out of the ordinary.

/L
--
Lasse Reichstein Holst Nielsen
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Mark Space
Guest
Posts: n/a

 11-12-2008
Lasse Reichstein Nielsen wrote:

> On average it's approx. 2.4 collisions. Here we get three, which has a ~30%
> chance of happening.
> I.e., this really is nothing out of the ordinary.

That's exactly what I was thinking, just too lazy to actually math it out.

public boolean
Guest
Posts: n/a

 11-12-2008
Lew wrote:
> public boolean wrote:
>> Even integers are more common than odd,

>
> How do you figure? I see an exact 50%-50% distribution.

In actual practice rather than in theory. (Surely you don't believe that
in actual production code all of the integers from -2147483648 to
2147483647, inclusive, occur exactly as frequently as one another?)

If you look at the integers that pop up in actual usage, you'll find
that smaller integers are more common than larger ones; even ones more
common than odd; generally, ones with many factors are more common; ones
with larger powers of two as factors are more common; primes other than
2, 3, and 5 are relatively uncommon; and large primes are rare.

Loops from 0 to N-1 will be partly responsible for the small-integer
bias but partly masking the factor-based biases, since they'll hit
everything from 0 to N-1; in particular, 0 to 2N-1 has equal numbers of
odd and even integers and 0 to 2N has one extra even integer so odd/even
will be almost in balance in such loops.

Since such loops don't contribute to the possible-hash-key population
(when you're looping over the whole range of indices, you use an array;
when you use a sparse array, you use map.entrySet()) the bias in keys
will be greater than the bias in used integers in general, save for the
bias towards smaller integers, and that bias will nonetheless still be
present.

public boolean
Guest
Posts: n/a

 11-12-2008
Mark Space wrote:
> I just want to point out that Java's hash map doesn't use a straight
> mapping between object.hashCode() and the hash index values.

Interesting. That improves the situation somewhat for bit-correlated
input hashes, at a performance cost whenever a hash is added or looked
up instead of only when the hash is generated. (The hash of a particular
Object can be, and often is, calculated once and then stored. The
identity hash definitely is, as is the String hash. Therefore the hash
may be generated less often than it is used. The hash may also be used
other than in a hash table -- Object's toString seems to use it in
combination with the class name, for instance. Therefore the hash may be
generated more often as well. In practice, however, it is probably
generated less often on average.)

public boolean
Guest
Posts: n/a

 11-12-2008
Lasse Reichstein Nielsen wrote:
> On average it's approx. 2.4 collisions. Here we get three, which has a ~30%
> chance of happening.
> I.e., this really is nothing out of the ordinary.

The example might have been better. The basic point remains -- the less
correlated, bit-correlated, or clustered the hash values, the better, as
a general rule.

public boolean
Guest
Posts: n/a

 11-12-2008
Lew wrote:
> public boolean wrote:
>> Eric Sosman wrote:
>>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>>> except possibly for the worse if h(0)==h(6), say.
>>>
>>>> [... more pseudo-numerological nonsense ...]

>>
>> Your insulting attitude in response to a perfectly civil post baffles me.

>
> Oh, Christ, it *is* Twisted again. I was afraid of that.
>
> Quit hiding who you are, or are you embarrassed to admit it?
>
> Re-plonk.

???

Can anyone else make any sense of this post?

public boolean
Guest
Posts: n/a

 11-12-2008
Eric Sosman wrote:
> public boolean wrote:
>> Eric Sosman wrote:
>>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>>> except possibly for the worse if h(0)==h(6), say.
>>>
>>>> [... more pseudo-numerological nonsense ...]

>>
>> Your insulting attitude in response to a perfectly civil post baffles me.

>
> I apologize for the tone of my response, and wish I had
> written less antagonistically. However, I stand by the
> content.

Including your insulting assertion that something I'd written was
"nonsense"? Or just the more technical and meaningful bits?

If the latter, apology accepted.

FWIW, you're not the only one that's written antagonistically. Lew has
written several more antagonistic posts here than you have, and one
completely baffling one that seems to have nothing whatsoever to do with
the post it replies to, or with Java, or with the price of tea in China
for that matter.

>> It's not just the distribution of the hash values over the space of
>> possible integer values that counts. It's the distribution of the hash
>> values over the actual hash buckets produced in an actual hash table,
>> which typically is much smaller. You can have few collisions in a
>> large space and still end up with many in a small space, especially if
>> the small space's size is a power of two and the low-order bits of the
>> hash are biased. Even integers are more common than odd, so in any
>> hash table with integer keys using the current integer hash code and
>> with an even number of buckets, it's likely the odd buckets will be
>> underutilized and the even ones will have an unnecessary collision or
>> several.

>
> "Even integers are more common than odd" is a claim that
> fails to convince me.

Fine. I stand by it nonetheless.

> But even if it were true, I do not see how it could matter.
> Have you studied the implementation of java.util.HashMap?

No. It might even be a Sun trade secret, for all that I know about it.

> static int hash(int h) {
> h ^= (h >>> 20) ^ (h >>> 12);
> return h ^ (h >>> 7) ^ (h >>> 4);
> }

Watch it -- Sun might sue.

This might help, or just waste time, depending. If there are bit-wise
correlations between two hash values, they will remain in the output of
this in some transformed and subtler form. It looks like each output bit
depends on nine of the input bits, including at least three of the
highest several bits. Unfortunately that won't wipe out all correlation.
A low bit usually being clear (for example, in a population of mostly
even numbers) could affect the probabilities of several bits having a
particular value. Use of xor means one completely uncorrelated bit in
the input is enough to make the output bits it affects 50/50, at least,
but not enough to assure that they're uncorrelated with one another in
some way induced by the input's bit-biases.

In the case of Integer keys, most of the high bits will often be zero,
due to the small-number bias, which reduces the effectiveness of the
above further, since xor by zero is a no-op.

Multiplying by a prime number in the range suggested would work better
and might not be any slower on modern hardware.

> That is, bit 0 of the effective hash value depends on all of
> bits 0, 4, 7, 12, 16, 19, 20, 24, and 27 of the hashCode().
> The choice of even-numbered or odd-numbered bucket depends
> on nine bits of the hashCode(), not on bit 0 alone.

An improvement in that one particular case, yes. How does having most of
the higher bits be zero affect that? In the case of keys from 0 to 999,
that means bits 10 and up, leaving of the above only 0, 4, and 7.

> Again, I apologize for the tone of my earlier message.
> But I still think your arguments are wide of the mark.

I saw what could be a weakness in a particular hash function (one that
doesn't exist if you assume all Integer values will occur equally often,
mind you). I pointed it out. It may be somewhat mitigated by what you've
revealed here. It could be eliminated entirely, particularly the
small-number bias, as indicated previously. (The hash method you posted
above could do it, instead of the Integer hash, affecting every hash
used in HashMap. Though my preference would be for HashMap to be fast
and individual objects' hash calculations do the shuffling. Actually,
since the identities of objects used as keys should not change, there
really should have been some notion of immutable objects in Java from
the beginning, and equals and hashCode only applicable to those. Then
hashCode could have been coded to call a protected method to calculate
the actual hash when it's first used, mangle the result, and cache it.
The same bits discussed recently here as storing the identity hash could
have been used to store the object hash in general -- assuming identity
versus equals never matters for immutable objects, anyway -- and other
objects would use the identity hash and equals always. Indeed equals
could go away then and == be applied differently to immutable objects in
that case -- if identity-equals, true, otherwise equals() rather than if
identity-equals, true, otherwise false. But I suppose it's way, way too
late to make changes that drastic to Java!)

Arne Vajhøj
Guest
Posts: n/a

 11-13-2008
Lew wrote:
> public boolean wrote:
>> Eric Sosman wrote:
>>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>>> except possibly for the worse if h(0)==h(6), say.
>>>
>>>> [... more pseudo-numerological nonsense ...]

>>
>> Your insulting attitude in response to a perfectly civil post baffles me.

>
> Oh, Christ, it *is* Twisted again. I was afraid of that.
>
> Quit hiding who you are, or are you embarrassed to admit it?
>
> Re-plonk.

You.

You will need those GB for the kill file - he creates a new
identity almost every week.

Arne

Arne Vajhøj
Guest
Posts: n/a

 11-13-2008
public boolean wrote:
> Lew wrote:
>> public boolean wrote:
>>> Eric Sosman wrote:
>>>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>>>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>>>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>>>> except possibly for the worse if h(0)==h(6), say.
>>>>
>>>>> [... more pseudo-numerological nonsense ...]
>>>
>>> Your insulting attitude in response to a perfectly civil post baffles
>>> me.

>>
>> Oh, Christ, it *is* Twisted again. I was afraid of that.
>>
>> Quit hiding who you are, or are you embarrassed to admit it?
>>
>> Re-plonk.

>
> ???
>
> Can anyone else make any sense of this post?

Paul: Anyone that has read this group for a few weeks can.

Arne

Eric Sosman
Guest
Posts: n/a

 11-13-2008
public boolean wrote:
> Eric Sosman wrote:
>> public boolean wrote:
>>> Eric Sosman wrote:
>>>> Nonsense. If the input consists of 100 zeroes, 10 ones,
>>>> and one each of 2..9, the hashes will consist of 100 h(0)s,
>>>> 10 h(1)s, and one each of h(2..9). Nothing has changed,
>>>> except possibly for the worse if h(0)==h(6), say.
>>>>
>>>>> [... more pseudo-numerological nonsense ...]
>>>
>>> Your insulting attitude in response to a perfectly civil post baffles
>>> me.

>>
>> I apologize for the tone of my response, and wish I had
>> written less antagonistically. However, I stand by the
>> content.

>
> Including your insulting assertion that something I'd written was
> "nonsense"? Or just the more technical and meaningful bits?

I apologized, and still apologize, for the tone, as I said.
And I stand by the content: Your arguments are nonsense, founded
on the numerological superstitions that (for reasons that have
never been clear to me) surround hash structures.

>>> Even integers are more common
>>> than odd, [...]

>>
>> "Even integers are more common than odd" is a claim that
>> fails to convince me.

>
> Fine. I stand by it nonetheless.

Evidence? We don' need no steenkin' evidence!

>> But even if it were true, I do not see how it could matter.
>> Have you studied the implementation of java.util.HashMap?

>
> No. It might even be a Sun trade secret, for all that I know about it.

Tou clearly know very little (not an insult, but a fact
subject to objective verification). Have you ever installed
Sun's JDK? Have you ever looked at the ZIP of Java source files
provided as part of that JDK?

>> static int hash(int h) {
>> h ^= (h >>> 20) ^ (h >>> 12);
>> return h ^ (h >>> 7) ^ (h >>> 4);
>> }

>
> Watch it -- Sun might sue.

What's funny? Either Sun will sue, in which case there's
nothing funny about it, or you're just being silly, in which
case ditto. (Yes, the tone of this paragraph is antagonistic,
but IMHO you are begging for it.)

> This might help, or just waste time, depending. If there are bit-wise
> correlations between two hash values, they will remain in the output of
> this in some transformed and subtler form. It looks like each output bit
> depends on nine of the input bits, including at least three of the
> highest several bits. Unfortunately that won't wipe out all correlation.
> A low bit usually being clear (for example, in a population of mostly
> even numbers) could affect the probabilities of several bits having a
> particular value. Use of xor means one completely uncorrelated bit in
> the input is enough to make the output bits it affects 50/50, at least,
> but not enough to assure that they're uncorrelated with one another in
> some way induced by the input's bit-biases.

*No* function of non-random data can eliminate the non-
randomness. No, not even the multiplication by phi that catches
your fancy. If the input values are non-random, any function of
those values will also be non-random, no matter how you choose to
scramble them. This is not news.

> In the case of Integer keys, most of the high bits will often be zero,
> due to the small-number bias, which reduces the effectiveness of the
> above further, since xor by zero is a no-op.
>
> Multiplying by a prime number in the range suggested would work better
> and might not be any slower on modern hardware.

"Would work better?" Sort of depends on (1) the internals
of the hash implementation, of which you profess ignorance, and
(2) the distribution of the inputs, for which you offer only
unsupported claims that you "stand by, nonetheless." Evidence?
We don' need no steenkin' evidence!

>> That is, bit 0 of the effective hash value depends on all of
>> bits 0, 4, 7, 12, 16, 19, 20, 24, and 27 of the hashCode().
>> The choice of even-numbered or odd-numbered bucket depends
>> on nine bits of the hashCode(), not on bit 0 alone.

>
> An improvement in that one particular case, yes. How does having most of
> the higher bits be zero affect that? In the case of keys from 0 to 999,
> that means bits 10 and up, leaving of the above only 0, 4, and 7.

prevalence of even numbers) are diminished by a factor of four
at the least, right? And what's this "one particular case" you
refer to? Which integer value is the one you worry about? 42?

>> Again, I apologize for the tone of my earlier message.
>> But I still think your arguments are wide of the mark.

>
> I saw what could be a weakness in a particular hash function (one that
> doesn't exist if you assume all Integer values will occur equally often,
> mind you). I pointed it out. It may be somewhat mitigated by what you've
> revealed here. It could be eliminated entirely, particularly the
> small-number bias, as indicated previously. (The hash method you posted
> above could do it, instead of the Integer hash, affecting every hash
> used in HashMap. Though my preference would be for HashMap to be fast
> and individual objects' hash calculations do the shuffling. Actually,
> since the identities of objects used as keys should not change, there
> really should have been some notion of immutable objects in Java from
> the beginning, and equals and hashCode only applicable to those. Then
> hashCode could have been coded to call a protected method to calculate
> the actual hash when it's first used, mangle the result, and cache it.
> The same bits discussed recently here as storing the identity hash could
> have been used to store the object hash in general -- assuming identity
> versus equals never matters for immutable objects, anyway -- and other
> objects would use the identity hash and equals always. Indeed equals
> could go away then and == be applied differently to immutable objects in
> that case -- if identity-equals, true, otherwise equals() rather than if
> identity-equals, true, otherwise false. But I suppose it's way, way too
> late to make changes that drastic to Java!)

I'm sorry, but I'm unable to extract sense from this
paragraph. You seem to be saying that List and StringBuilder
and Dimension should not have .equals() methods -- and although
I disagree with that assertion, it's a debate for some other
thread and has no obvious bearing on the matter of Integer's
hashCode(). Perhaps when I'm older and wiser I'll understand
you, but for the moment the logic of your contention eludes me.

And yes, I'm now being antagonistic. And I'm no longer
sorry for it.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid