Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > performance of HashSet with strings and integers

Reply
Thread Tools

performance of HashSet with strings and integers

 
 
Frederik
Guest
Posts: n/a
 
      10-07-2009
Hi,

I thought of replacing strings with integers in some code I wrote,
because I believed it would benefit performance. But before doing so,
I made a small test class:

public class testStringSetVsIntSet {
public static void main(String[] args) {
long time;
boolean b;
Set set;

Integer I1 = new Integer(100), I2 = new Integer(500);

set = new HashSet();
set.add(I1);
set.add(900);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(I1);
b = set.contains(I2);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 1: " + time);

String headStr = "Head";
String agentStr = "Agent";
String qualifStr = "Qualif";

set = new HashSet();
set.add(headStr);
set.add(agentStr);

time = System.currentTimeMillis();
for (int i=0; i<50000000; i++) {
b = set.contains(headStr);
b = set.contains(qualifStr);
}

time = System.currentTimeMillis() - time;
System.out.println("Time 2: " + time);
}
}

But to my surprise, the second loop with the strings appeared to be
twice as fast as the first one with the integers! (first loop 3
seconds, second 1.5 seconds)
I didn't expect this because calculating the hashcode for a string is
normally slower than for an integer (every string character is taken
into account) and I thought the "equals" method for a string should be
slower than for an Integer as well.
Can anybody explain this to me?
 
Reply With Quote
 
 
 
 
Mike Schilling
Guest
Posts: n/a
 
      10-07-2009
Frederik wrote:
> Hi,
>
> I thought of replacing strings with integers in some code I wrote,
> because I believed it would benefit performance. But before doing
> so,
> I made a small test class:
>
> public class testStringSetVsIntSet {
> public static void main(String[] args) {
> long time;
> boolean b;
> Set set;
>
> Integer I1 = new Integer(100), I2 = new Integer(500);
>
> set = new HashSet();
> set.add(I1);
> set.add(900);
>
> time = System.currentTimeMillis();
> for (int i=0; i<50000000; i++) {
> b = set.contains(I1);
> b = set.contains(I2);
> }
>
> time = System.currentTimeMillis() - time;
> System.out.println("Time 1: " + time);
>
> String headStr = "Head";
> String agentStr = "Agent";
> String qualifStr = "Qualif";
>
> set = new HashSet();
> set.add(headStr);
> set.add(agentStr);
>
> time = System.currentTimeMillis();
> for (int i=0; i<50000000; i++) {
> b = set.contains(headStr);
> b = set.contains(qualifStr);
> }
>
> time = System.currentTimeMillis() - time;
> System.out.println("Time 2: " + time);
> }
> }
>
> But to my surprise, the second loop with the strings appeared to be
> twice as fast as the first one with the integers! (first loop 3
> seconds, second 1.5 seconds)
> I didn't expect this because calculating the hashcode for a string
> is
> normally slower than for an integer (every string character is taken
> into account) and I thought the "equals" method for a string should
> be
> slower than for an Integer as well.
> Can anybody explain this to me?


1. Since Strings are immutable, their hash code is calculated only
once and then stored in the String object.
2. Since your strings are of different lengths, they're discovered to
be unequal almost immediately (the second thing equals() checks is the
number of characters).
3. When contains() returns true, its parameter is nott just equal to
the one in the set but the same object; thus it's discovered to be
equal even faster (since the first thing equals() checks is "this ==
param").
4.This one is just a guess, but I wouldn't be surprised if the Integer
version became faster if you reversed the order. In general, Java
gets faster as it goes and the JIT can optimize the code.


 
Reply With Quote
 
 
 
 
Pascal Lecointe
Guest
Posts: n/a
 
      10-07-2009
On 7 oct, 16:06, Patricia Shanahan <(E-Mail Removed)> wrote:
>
> The equals method only gets called if the HashSet contains an element
> whose hash code is equal to the probe's hash code but that is not the
> same object as the probe. That is very unlikely in a test with so few
> objects.


Not really. If I remember correctly, the equals is always called
(because same
hash not imply same object). But in this case, the equals first test
that the
references are the same. As the Strings of the example are all in the
constant pool,
the references are the same, and so the equals test is fast.

>
> I strongly suspect that your real use of the HashSet is significantly
> different from your benchmark. The large number of contains calls with
> the same key, the small number of distinct strings, and the lack of any
> case in which you have two distinct but equal objects may all affect the
> results.
>


+1

 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      10-07-2009
Patricia Shanahan wrote:
>
> The equals method only gets called if the HashSet contains an
> element
> whose hash code is equal to the probe's hash code but that is not
> the
> same object as the probe. That is very unlikely in a test with so
> few
> objects.


Does HashSet check == rather than letting the equals() method do it?
So it does. You learn something new every day. Though in the case,
as I mentioned earlier, String.equals() would be as fast as
Integer.equals().


 
Reply With Quote
 
Frederik
Guest
Posts: n/a
 
      10-07-2009
> 4.This one is just a guess, but I wouldn't be surprised if the Integer
> version became faster if you reversed the order. *In general, Java
> gets faster as it goes and the JIT can optimize the code.


I tried that, but that is not the case, the version with the strings
remains twice as fast.

Thank you both for your help.
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      10-07-2009
Frederik wrote:
>> 4.This one is just a guess, but I wouldn't be surprised if the
>> Integer version became faster if you reversed the order. In
>> general,
>> Java gets faster as it goes and the JIT can optimize the code.

>
> I tried that, but that is not the case, the version with the strings
> remains twice as fast.


In that case, I'd guess you picked numbers that hash to the same
bucket. Yup. The initial capacity of a HashSet is 16, and all the
numbers equal 0 mod 16.


 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      10-07-2009
Patricia Shanahan wrote:
>> The equals method only gets called if the HashSet contains an element
>> whose hash code is equal to the probe's hash code but that is not the
>> same object as the probe. That is very unlikely in a test with so few
>> objects.

>


Pascal Lecointe erroneously claimed:
> Not really. If I remember correctly, the equals is always called


You don't remember correctly.

> (because same hash not imply same object).


No, but different hash codes do imply different objects. Note that
Patricia pointed out that 'equals()' is only called if the hash codes
are the same. It is not called if the hash codes differ, because in
that case 'equals()' would always return 'false', so there's no point.

> But in this case, the equals first test
> that the
> references are the same. As the Strings of the example are all in the
> constant pool,
> the references are the same, and so the equals test is fast.
>


Really, really fast in the OP's example, because 'equals()' is not
called.

It has nothing to do with the Strings being in the constant pool.

--
Lew

 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      10-07-2009
On Oct 7, 10:59*am, Patricia Shanahan <(E-Mail Removed)> wrote:
> Pascal Lecointe wrote:
> > On 7 oct, 16:06, Patricia Shanahan <(E-Mail Removed)> wrote:
> >> The equals method only gets called if the HashSet contains an element
> >> whose hash code is equal to the probe's hash code but that is not the
> >> same object as the probe. That is very unlikely in a test with so few
> >> objects.

>
> > Not really. If I remember correctly, the equals is always called
> > (because same
> > hash not imply same object). But in this case, the equals first test
> > that the
> > references are the same. As the Strings of the example are all in the
> > constant pool,
> > the references are the same, and so the equals test is fast.

>
> The match test in HashMap's getEntry is such that either hash code
> mismatch or reference equality prevents execution of the equals call.
> The implementation of the HashSet contains uses the HashMap containsKey,
> which uses getEntry.
>


Note also that typical implementations of 'equals()' that do not do
mere reference equality do check reference equality first, and only
compare attributes if the references differ. Of course, there's no
guarantee that any particular 'equals()' implementation does this. In
the case of the logic just described, it may well be that reference
equality is checked twice - once by the 'Map' and once by the element
- before value equality is finally invoked.

To review:

HashMap equality checks:
hashCode(): if unequal, no need to continue
reference == : if 'true', no need to continue
value equality: return result

--
Lew
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      10-15-2009
On Wed, 7 Oct 2009 06:48:14 -0700 (PDT), Frederik
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone who
said :

>Set set;
>Integer I1 = new Integer(100), I2 = new Integer(500);
>set = new HashSet();


Aside from the main thrust of your question, you should code this with
generics and autoboxing like this:

final Set<Integer> set = new HashSet<Integer>(7);
// give it a size estimate.
Integer i1 = 100; // follow caps naming conventions
Integer i2 = 500; // use autoboxing for brevity
--
Roedy Green Canadian Mind Products
http://mindprod.com

I advocate that super programmers who can juggle vastly more complex balls than average guys can, should be banned, by management, from dragging the average crowd into system complexity zones where the whole team will start to drown.
~ Jan V.
 
Reply With Quote
 
Kevin McMurtrie
Guest
Posts: n/a
 
      10-15-2009
In article
<(E-Mail Removed)>,
Frederik <(E-Mail Removed)> wrote:

> Hi,
>
> I thought of replacing strings with integers in some code I wrote,
> because I believed it would benefit performance. But before doing so,
> I made a small test class:
>
> public class testStringSetVsIntSet {
> public static void main(String[] args) {
> long time;
> boolean b;
> Set set;
>
> Integer I1 = new Integer(100), I2 = new Integer(500);
>
> set = new HashSet();
> set.add(I1);
> set.add(900);
>
> time = System.currentTimeMillis();
> for (int i=0; i<50000000; i++) {
> b = set.contains(I1);
> b = set.contains(I2);
> }
>
> time = System.currentTimeMillis() - time;
> System.out.println("Time 1: " + time);
>
> String headStr = "Head";
> String agentStr = "Agent";
> String qualifStr = "Qualif";
>
> set = new HashSet();
> set.add(headStr);
> set.add(agentStr);
>
> time = System.currentTimeMillis();
> for (int i=0; i<50000000; i++) {
> b = set.contains(headStr);
> b = set.contains(qualifStr);
> }
>
> time = System.currentTimeMillis() - time;
> System.out.println("Time 2: " + time);
> }
> }
>
> But to my surprise, the second loop with the strings appeared to be
> twice as fast as the first one with the integers! (first loop 3
> seconds, second 1.5 seconds)
> I didn't expect this because calculating the hashcode for a string is
> normally slower than for an integer (every string character is taken
> into account) and I thought the "equals" method for a string should be
> slower than for an Integer as well.
> Can anybody explain this to me?


A few things:

- Strings cache their hash value in recent JVMs.
- The performance of hashing varies for individual keys based on
collisions and locations within the collision chain.
- All of your map hits can use the fast '==' test rather than equals(),
making collisions an unusually large factor of performance.
- Hotspot was compiling during loop 1. Call the test method twice and
the second results will probably be different.

nitpick: Integer.valueOf(n) and autoboxing returns pooled objects for a
range of values so it's sometimes much more efficient than the Integer
constructor. It's not in the loop so it doesn't matter here.
--
I won't see Goolge Groups replies because I must filter them as spam
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
TreeSet and HashSet Marcin Java 18 02-06-2007 10:39 PM
HashSet and TreeSet Ye Dafeng Java 4 11-16-2006 03:00 AM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
HashSet performance zero Java 14 11-14-2005 01:12 AM
Re: HashSet doesn't work correctly!!?? La'ie Techie Java 0 09-26-2003 01:22 AM



Advertisments