Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > HashSet keeps all nonidentical equal objects in memory

Reply
Thread Tools

HashSet keeps all nonidentical equal objects in memory

 
 
Frederik
Guest
Posts: n/a
 
      07-20-2011
Hi,

I've been doing java programming for over 10 years, but now I've
encoutered a phenomenon that I wasn't aware of at all.
I had an application in which I have a HashSet<String>. I added a lot
of different String objects to this HashSet, but many of the String
objects are equal to each other. Now, after a while my application ran
out of memory, even with -Xmx1500M. This happened when there were only
about 7000 different Strings in the set! I didn't understand this,
until I started adding the "intern()" of every String object to the
set instead of the original String object. Now the program needs
virtually no memory anymore.
There is only one explanation: before I used "intern()", ALL the
different String objects, even the ones that are equal, were kept in
memory by the HashSet! No matter how strange it sounds. I was
wondering, does anybody have an explanation as to why this is the case?
 
Reply With Quote
 
 
 
 
Frederik
Guest
Posts: n/a
 
      07-20-2011
On Jul 20, 11:43*am, Frederik <(E-Mail Removed)> wrote:
> Hi,
>
> I've been doing java programming for over 10 years, but now I've
> encoutered a phenomenon that I wasn't aware of at all.
> I had an application in which I have a HashSet<String>. I added a lot
> of different String objects to this HashSet, but many of the String
> objects are equal to each other. Now, after a while my application ran
> out of memory, even with -Xmx1500M. This happened when there were only
> about 7000 different Strings in the set! I didn't understand this,
> until I started adding the "intern()" of every String object to the
> set instead of the original String object. Now the program needs
> virtually no memory anymore.
> There is only one explanation: before I used "intern()", ALL the
> different String objects, even the ones that are equal, were kept in
> memory by the HashSet! No matter how strange it sounds. I was
> wondering, does anybody have an explanation as to why this is the case?


I did another experiment in which I generated a large file with the
same string on every line, and then wrote a program that reads in the
file and adds every line String to a HashSet. With that, I cannot
reproduce the problem. So in my other code I must be doing something
very specific which triggers this problem. I don't know what it is,
but never mind.
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      07-20-2011
On 7/20/2011 5:43 AM, Frederik wrote:
> Hi,
>
> I've been doing java programming for over 10 years, but now I've
> encoutered a phenomenon that I wasn't aware of at all.
> I had an application in which I have a HashSet<String>. I added a lot
> of different String objects to this HashSet, but many of the String
> objects are equal to each other. Now, after a while my application ran
> out of memory, even with -Xmx1500M. This happened when there were only
> about 7000 different Strings in the set! I didn't understand this,
> until I started adding the "intern()" of every String object to the
> set instead of the original String object. Now the program needs
> virtually no memory anymore.
> There is only one explanation: before I used "intern()", ALL the
> different String objects, even the ones that are equal, were kept in
> memory by the HashSet! No matter how strange it sounds. I was
> wondering, does anybody have an explanation as to why this is the case?


I'm unable to reproduce your problem (see test program below).
Perhaps you've overlooked another possible explanation: Before you
switched to using intern(), maybe you were retaining your own
references to all those Strings accidentally.

Here's my test program: It inserts twenty thousand distinct but
identical Strings into a HashSet, pausing every now and then to
report how much memory is used (with some heavy-handed attempts to
force garbage collection):

package esosman.misc;
import java.util.HashSet;

public class HashSpace {

public static void main(String[] unused) {
HashSet<String> set = new HashSet<String>();
String value = "x";
for (int n = 0; n < 20; ++n) {
report(n * 1000);
for (int i = 0; i < 1000; ++i) {
value = (value + "x").substring(1);
set.add(value);
}
}
report(20 * 1000);
}

private static void report(int insertions) {
long memUsed = runtime.totalMemory() - runtime.freeMemory();
long memPrev = Long.MAX_VALUE;
for (int gc = 0; (memUsed < memPrev) && gc < 5; ++gc) {
runtime.runFinalization();
runtime.gc();
Thread.yield();
memPrev = memUsed;
memUsed = runtime.totalMemory() - runtime.freeMemory();
}
System.out.printf("After %d insertions, memory used = %d\n",
insertions, memUsed);
}

private static final Runtime runtime = Runtime.getRuntime();
}

.... and here's what I get for output:

After 0 insertions, memory used = 125656
After 1000 insertions, memory used = 133272
After 2000 insertions, memory used = 133664
After 3000 insertions, memory used = 133272
After 4000 insertions, memory used = 133312
After 5000 insertions, memory used = 133272
After 6000 insertions, memory used = 133312
After 7000 insertions, memory used = 133272
After 8000 insertions, memory used = 133312
After 9000 insertions, memory used = 133272
After 10000 insertions, memory used = 133312
After 11000 insertions, memory used = 133272
After 12000 insertions, memory used = 133312
After 13000 insertions, memory used = 133448
After 14000 insertions, memory used = 133840
After 15000 insertions, memory used = 133448
After 16000 insertions, memory used = 133488
After 17000 insertions, memory used = 133272
After 18000 insertions, memory used = 133312
After 19000 insertions, memory used = 133272
After 20000 insertions, memory used = 133312

I see no evidence that all those String instances are being
retained anywhere: They need ~24 bytes apiece, which would come
to about half a megabyte.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)d
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      07-20-2011
On 7/20/2011 4:09 AM, Frederik wrote:
> With that, I cannot
> reproduce the problem. So in my other code I must be doing something
> very specific which triggers this problem. I don't know what it is,
> but never mind.



Ayup. As usual it's in the code we are not shown. Good job working out
the answer yourself though.

Hmm, looking at your problem description again, can you verify that when
the problem occurs, there are much more than 7000 strings in the set?
I.e., the set is really retaining all of the objects, or is there a
memory leak somewhere else?

It's possible to make strings such that they hold memory and cannot
release it. Are you calling substring() a lot to create the strings?
If so they could be holding memory from a larger string. If the
substrings are equal and not retained, but the large string is held in
memory because the smaller string will contain a pointer to it, this
could occur.

If so, try calling "new String( substr )" on the sub-strings to make an
entirely new object instead of intern(). If this also works, that was
likely your problem.

Try running the code under a profiler, with memory profiling enabled
(and the problem verify to occur first), and post the results.



Other ideas that are probably not as workable:

Make sure it's really HashSet and not some derived (and broken)
implementation. There are also things called IdentityHashSet out there
in the wild that will give you the behavior are seeing.


Make sure it's really String and not something else. I don't know how
this could happen but it is pretty common when putting user defined
objects to mess up the hashcode() and equals() methods, which will break
containers like HashSet and HashMap.

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      07-20-2011
On 20 Jul., 11:43, Frederik <(E-Mail Removed)> wrote:
> I've been doing java programming for over 10 years, but now I've
> encoutered a phenomenon that I wasn't aware of at all.


Apparently you didn't - as you found out in the meantime.

> I had an application in which I have a HashSet<String>. I added a lot
> of different String objects to this HashSet, but many of the String
> objects are equal to each other. Now, after a while my application ran
> out of memory, even with -Xmx1500M. This happened when there were only
> about 7000 different Strings in the set! I didn't understand this,
> until I started adding the "intern()" of every String object to the
> set instead of the original String object. Now the program needs
> virtually no memory anymore.
> There is only one explanation: before I used "intern()", ALL the
> different String objects, even the ones that are equal, were kept in
> memory by the HashSet! No matter how strange it sounds. I was
> wondering, does anybody have an explanation as to why this is the case?


No, that conclusion is not warranted by the facts. You only know that
*something* kept hold of a lot of memory (String instances). Since we
do neither know all the code nor do we know the application
architecture we can only speculate but it seems a realistic assumption
that those String instances are not only kept by the HashSet but
somewhere else.

An easy way you can create such a situation is that you are reading
from some external source (file) repeated content and create an object
which - among other things - holds the String. Now you have 1,000,000
objects holding on to 1,000,000 String instances but there are only
7,000 different character sequences. In such a situation it may be
better to have a HashMap<String,String> where you store the String
only once and reuse that first instance. Basically this is what
happened when you used String.intern() only that you do not have
control over this storage any more which - depending on application
type - can still create a serious memory leak, e.g. long running app
which over time reads multiple files with different sets of repeated
strings.

Kind regards

robert
 
Reply With Quote
 
lewbloch
Guest
Posts: n/a
 
      07-20-2011
On Jul 20, 8:38*am, Robert Klemme <(E-Mail Removed)> wrote:
> On 20 Jul., 11:43, Frederik <(E-Mail Removed)> wrote:
>
> > I've been doing java programming for over 10 years, but now I've
> > encoutered a phenomenon that I wasn't aware of at all.

>
> Apparently you didn't - as you found out in the meantime.
>
> > I had an application in which I have a HashSet<String>. I added a lot
> > of different String objects to this HashSet, but many of the String
> > objects are equal to each other. Now, after a while my application ran
> > out of memory, even with -Xmx1500M. This happened when there were only
> > about 7000 different Strings in the set! I didn't understand this,
> > until I started adding the "intern()" of every String object to the
> > set instead of the original String object. Now the program needs
> > virtually no memory anymore.
> > There is only one explanation: before I used "intern()", ALL the
> > different String objects, even the ones that are equal, were kept in
> > memory by the HashSet! No matter how strange it sounds. I was
> > wondering, does anybody have an explanation as to why this is the case?

>
> No, that conclusion is not warranted by the facts. *You only know that
> *something* kept hold of a lot of memory (String instances). *Since we
> do neither know all the code nor do we know the application
> architecture we can only speculate but it seems a realistic assumption
> that those String instances are not only kept by the HashSet but
> somewhere else.
>
> An easy way you can create such a situation is that you are reading
> from some external source (file) repeated content and create an object
> which - among other things - holds the String. *Now you have 1,000,000
> objects holding on to 1,000,000 String instances but there are only
> 7,000 different character sequences. *In such a situation it may be
> better to have a HashMap<String,String> where you store the String
> only once and reuse that first instance. *Basically this is what
> happened when you used String.intern() only that you do not have
> control over this storage any more which - depending on application
> type - can still create a serious memory leak, e.g. long running app
> which over time reads multiple files with different sets of repeated
> strings.
>


To highlight one of Robert's points much more specifically,
undisciplined use of 'intern()' can create memory pressure itself.
It's not really a good idea to intern every single 'String' because
that uses up the intern space and disables GC to clean up dead
strings.

Sweeping dirt under the carpet makes dirt less visible, not the floor
clean.

--
Lew
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Internet Sharing: Equal upload speeds but un-equal download speeds =?Utf-8?B?TkpU?= Wireless Networking 3 09-15-2007 06:22 AM
HashSet performance zero Java 14 11-14-2005 01:12 AM
HashSet with complicated objects Mike Java 3 11-20-2004 05:24 AM
HashSet is strange Anony! Java 8 08-29-2004 06:07 PM
Re: HashSet doesn't work correctly!!?? La'ie Techie Java 0 09-26-2003 01:22 AM



Advertisments