Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > which collection to use

Reply
Thread Tools

which collection to use

 
 
gaurav v bagga
Guest
Posts: n/a
 
      12-22-2006
hi all,

I have key,value pair of structure and I want to store them in a
collection. Which collection will be good with respect to retrieval
speed.

I found out from net that HashMap is better then TreeMap in retrieval
speed.


HashMap<String, String> hm = new HashMap<String, String>();
hm.put("1", "r");
hm.put("2", "r");
hm.put("3", "r");
hm.put("4", "r");
hm.put("5", "r");
hm.put("6", "r");
hm.put("7", "rd");
hm.put("8", "r");
hm.put("9", "r");
hm.put("10", "r");
hm.put("11", "r");
hm.put("12", "r");
hm.put("13", "r");
hm.put("14", "rd");

long time=System.nanoTime();
System.out.println(hm.get("10"));
System.out.println(System.nanoTime()-time);

TreeMap<String, String> tm = new TreeMap<String, String>();
tm.put("1", "r");
tm.put("2", "r");
tm.put("3", "r");
tm.put("4", "r");
tm.put("5", "r");
tm.put("6", "r");
tm.put("7", "rd");
tm.put("8", "r");
tm.put("9", "r");
tm.put("10", "r");
tm.put("11", "r");
tm.put("12", "r");
tm.put("13", "r");
tm.put("14", "rd");

time=System.nanoTime();
System.out.println(tm.get("10"));
System.out.println(System.nanoTime()-time);


but this code of mine gives result as

r
2405892
r
832229

it seems tree is better or is it that I am measuring the speed in wrong
way?

regards
gaurav

 
Reply With Quote
 
 
 
 
Steve W. Jackson
Guest
Posts: n/a
 
      12-22-2006
In article <(E-Mail Removed) .com>,
"gaurav v bagga" <(E-Mail Removed)> wrote:

> hi all,
>
> I have key,value pair of structure and I want to store them in a
> collection. Which collection will be good with respect to retrieval
> speed.
>
> I found out from net that HashMap is better then TreeMap in retrieval
> speed.
>
>
> HashMap<String, String> hm = new HashMap<String, String>();
> hm.put("1", "r");
> hm.put("2", "r");
> hm.put("3", "r");
> hm.put("4", "r");
> hm.put("5", "r");
> hm.put("6", "r");
> hm.put("7", "rd");
> hm.put("8", "r");
> hm.put("9", "r");
> hm.put("10", "r");
> hm.put("11", "r");
> hm.put("12", "r");
> hm.put("13", "r");
> hm.put("14", "rd");
>
> long time=System.nanoTime();
> System.out.println(hm.get("10"));
> System.out.println(System.nanoTime()-time);
>
> TreeMap<String, String> tm = new TreeMap<String, String>();
> tm.put("1", "r");
> tm.put("2", "r");
> tm.put("3", "r");
> tm.put("4", "r");
> tm.put("5", "r");
> tm.put("6", "r");
> tm.put("7", "rd");
> tm.put("8", "r");
> tm.put("9", "r");
> tm.put("10", "r");
> tm.put("11", "r");
> tm.put("12", "r");
> tm.put("13", "r");
> tm.put("14", "rd");
>
> time=System.nanoTime();
> System.out.println(tm.get("10"));
> System.out.println(System.nanoTime()-time);
>
>
> but this code of mine gives result as
>
> r
> 2405892
> r
> 832229
>
> it seems tree is better or is it that I am measuring the speed in wrong
> way?
>
> regards
> gaurav


I doubt that your test is really meaningful, having so few actual values
and using String for your keys as you do.

TreeMap is a sorted map. HashMap is not. Based on that, it's logical
to expect some overhead on the TreeMap that could impact performance
somewhat. Whether and when it becomes meaningful probably depends
largely on how much data you've got and your retrieval patterns.

But your choice of one or the other should not be driven by your
perception of speed in retrieving their contents, but by your program's
needs. If you never need to iterate over your collection in order of
its keys, why use a TreeMap? Understand the difference between the two,
determine what your need is, and choose whichever best fits.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama
 
Reply With Quote
 
 
 
 
Flo 'Irian' Schaetz
Guest
Posts: n/a
 
      12-22-2006
And thus spoke gaurav v bagga...

> I have key,value pair of structure and I want to store them in a
> collection. Which collection will be good with respect to retrieval
> speed.

....
> I found out from net that HashMap is better then TreeMap in retrieval
> speed.


NEVER EVER decide something like this based by throwing 14 values at it
and look how much "Date" it takes... Sorry, but it doesn't say much. I
wouldn't even trust such an experiment if you threw thousands of random
numbers at it

A better way to do it...

1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
algorithms are used by these classes. In some cases you'll find the
complexity of these algorithms there.

2. Use the web or a good book to find out, which Algorithem has which
complexity for which method, if you didn't find it in the JavaDoc.

THEN you can decide, which algorithm to use. If you do not understand an
algorithm, you will never be able to decide if it's the best one for
your job. If you don't know what "Complexity" or "O(n log n)" means, I
would advise you to read _at least_ what wikipedia has to offer about
these topics. You don't have to know how to prove a complexity, but you
should understand, what it means

Flo
 
Reply With Quote
 
Patricia Shanahan
Guest
Posts: n/a
 
      12-22-2006
Flo 'Irian' Schaetz wrote:
> And thus spoke gaurav v bagga...
>
>> I have key,value pair of structure and I want to store them in a
>> collection. Which collection will be good with respect to retrieval
>> speed.

> ...
>> I found out from net that HashMap is better then TreeMap in retrieval
>> speed.

>
> NEVER EVER decide something like this based by throwing 14 values at it
> and look how much "Date" it takes... Sorry, but it doesn't say much. I
> wouldn't even trust such an experiment if you threw thousands of random
> numbers at it
>
> A better way to do it...
>
> 1. Read the JavaDoc of HashMap and TreeMap. There you'll find which
> algorithms are used by these classes. In some cases you'll find the
> complexity of these algorithms there.
>
> 2. Use the web or a good book to find out, which Algorithem has which
> complexity for which method, if you didn't find it in the JavaDoc.
>
> THEN you can decide, which algorithm to use. If you do not understand an
> algorithm, you will never be able to decide if it's the best one for
> your job. If you don't know what "Complexity" or "O(n log n)" means, I
> would advise you to read _at least_ what wikipedia has to offer about
> these topics. You don't have to know how to prove a complexity, but you
> should understand, what it means
>
> Flo


I disagree somewhat with this advice. Big-O notation only tells you
about the limiting behavior for large problems.

If there is a very frequent need to retrieve from small collections, the
performance could be dominated by overheads and constant factor
differences that disappear in the limiting case. There is then no
substitute for measurement.

However, the measurement needs to be realistic. In particular, timing
one retrieval makes no sense at all.

Patricia
 
Reply With Quote
 
Flo 'Irian' Schaetz
Guest
Posts: n/a
 
      12-22-2006
And thus spoke Patricia Shanahan...

> I disagree somewhat with this advice. Big-O notation only tells you
> about the limiting behavior for large problems.


I just said "better" way, not "best" way But you're right, I simply
asumed, that he has a somewhat "bigger" problem to solve and not just 14
pairs, my fault. x*n can of course be bigger than n*n for small n, my
fault, really.

Anyway, understanding the algorithm seems like the best way to help
making a decision. Taking two classes which seem to do aprox. what one
wants to have done and than compare them by using "Date" isn't in my Top
10, at least

Flo
 
Reply With Quote
 
Patricia Shanahan
Guest
Posts: n/a
 
      12-22-2006
gaurav v bagga wrote:
> hi all,

.....
> r
> 2405892
> r
> 832229
>
> it seems tree is better or is it that I am measuring the speed in wrong
> way?


You are measuring speed in a very wrong way. Just taking your test, but
putting the work in separate methods and calling them multiple times, I got:

r
Hash 1437744
r
Tree 333823
r
Hash 151559
r
Tree 172166
r
Hash 2030422
r
Tree 182390
r
Hash 165632
r
Tree 181563
r
Hash 173731
r
Tree 177306
r
Hash 185852
r
Tree 177680
r
Hash 154439
r
Tree 183960


Obviously, the times are very variable, probably due to issues such as
cache misses, other programs running...

To measure this properly, go back to the program in which this is a
critical issue. Pick a typical list of items, and distribution of
requests. Make sure your test program reproduces those. Measure many
requests, and do the tests several times, alternating methods, so that
you are not charging one method with initial overheads.

The measured piece of work should take order seconds or tens of seconds,
so that the odd cache miss or page fault cannot affect the results.

Patricia
 
Reply With Quote
 
Martin Gregorie
Guest
Posts: n/a
 
      12-23-2006
Patricia Shanahan wrote:
> gaurav v bagga wrote:
>> it seems tree is better or is it that I am measuring the speed in wrong
>> way?

>

Yes, you are measuring it the wrong way.

In any situation where the system is multi-tasking (and Java is always
multitasking to a first approximation because there are at least two
threads running) its meaningless to measure elapsed time. The only valid
measure of efficiency for an task that doesn't involve i/o is the amount
of CPU time it consumes because that, at least, is not affected by other
processes unless the machine is so overloaded that every task switch
involves paging or task swapping.

Patricia's runs show interference by other processes very nicely.

If you are using Linux or a UNIX you should at least use the "time"
utility to measure the CPU time used. As you don't say what OS you're
using I can't say more than that.

> To measure this properly, go back to the program in which this is a
> critical issue. Pick a typical list of items, and distribution of
> requests. Make sure your test program reproduces those. Measure many
> requests, and do the tests several times, alternating methods, so that
> you are not charging one method with initial overheads.
>

Exactly. We can't advise you because we haven't any idea of:
- the number of items you want to search
- the relative size of the key and value in each item
- the degree of key duplication in your data
- the degree of key value clustering in your data
- the complexity of the key comparisons [1]

As Patricia said, the number of items and the key value distribution are
both vital information which you didn't provide.

The following assumes that all keys are not unique. It that's not true
then you need a more complex approach that can handle duplicates.

If the data is static the following is generally true: if you only do
infrequent searches of under 10 items a simple sequential search is
probably best because its overheads are minimal. With a few hundred
items and many searches of static data its better to sort the data once
its loaded and use a binary chop search: SortedMap will work this way
provided that you build an unsorted map first and then construct the
SortedMap from it. Adding unsorted data item by item to an empty
SortedMap is likely to be very slow.

If the data isn't static and is initially unordered then either a hash
table (HashMap) or a red-black binary tree (TreeMap) is the way to go.
If the HashMap hashing algorithm generates synonyms because of your key
value distribution or because you've set the initial capacity too low
its performance will degrade fast and the TreeMap would be better. If
you need to read out the data in order then TreeMap is the only game in
town.

[1] This alone can invalidate a theoretical analysis. For instance, most
people will swear blind that Quicksort will always blow a Replacement
sort into the weeds. No less an authority than Niklaus Wirth says so
without any qualification of this opinion so it *must* be so! But his
analysis assumes that swapping items is much more expensive than
comparing them. If your item swap method is very fast (e.g. swapping
references to large objects rather than copying the objects about in an
array) and the key comparison is slow (e.g. a lexicographic string
comparison over multiple keys rather than comparing single integer keys)
then you'll find in practice that a Replacement sort is several times
faster than Quicksort. I know this because I've been there.

> The measured piece of work should take order seconds or tens of seconds,
> so that the odd cache miss or page fault cannot affect the results.
>

Yes. You need a sample of real data to test against so you get a
representative key distribution and the sample has to be large enough to
give a realistic performance estimation for live data. This means it
*must* be within 2 - 3 orders of magnitude of the design data volume and
preferably within one order of magnitude. By way of illustration, I've
seen several databases that ran just fine on typical programmer test
volumes (under 10-50 items in the biggest table) but that had unusably
bad performance with a few tens of thousands of rows loaded.

If all this is new to you, go and find a copy of Sedgewick: Algorithms.
Its code examples are in Pascal, but any competent programmer should be
able to translate that to Java in their head. There are also Java
versions of this book - at a price.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
 
Reply With Quote
 
Patricia Shanahan
Guest
Posts: n/a
 
      12-23-2006
Martin Gregorie wrote:
> Patricia Shanahan wrote:
>> gaurav v bagga wrote:
>>> it seems tree is better or is it that I am measuring the speed in wrong
>>> way?

>>

> Yes, you are measuring it the wrong way.
>
> In any situation where the system is multi-tasking (and Java is always
> multitasking to a first approximation because there are at least two
> threads running) its meaningless to measure elapsed time. The only valid
> measure of efficiency for an task that doesn't involve i/o is the amount
> of CPU time it consumes because that, at least, is not affected by other
> processes unless the machine is so overloaded that every task switch
> involves paging or task swapping.


Yes and no. If the real objective is to measure CPU time it is fine to
measure that. However, measuring CPU time does not just eliminate
interference from other tasks, it also eliminates the job's own disk
wait time.

Also, CPU time can show strange effects when measuring multiple threads
on a hyperthreaded processor.

Ultimately, elapsed time is what matters.

Here's what I've done in the past when I wanted an accurate measure of
elapsed time for one job:

1. Disconnect all networking.

2. Only one user logged in.

3. Kill all non-essential daemons. I used to know exactly what was
needed to keep a Solaris system operational.

4. Kill all non-essential user jobs. Typically, I had one shell and the
job being measured, no window manager.

4. Hands off keyboard and mouse for the duration of the run.

However, for many cases that is overkill, and I just go with the time to
run a reasonably long piece of work.

Patricia
 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      12-23-2006
Patricia Shanahan wrote:
> However, for many cases that is overkill, and I just go with the time to
> run a reasonably long piece of work.


I'd agree here. I'd tend to measure average elapsed time over a large
number of jobs representative of the expected usage. The result should
be a time measurement representative of what the users can expect -- and
that number's more important than cycles or whatever.

I'd even have the system in the kind of state, multitasking-wise, that's
likely to be representative of the typical deployment environment.
Something than runs fast when it has sole use of the CPU but becomes a
pig in molasses due to cache misses or whatever when it runs in a
real-world situation will get caught this way, for starters. So will
something that makes the rest of the system unusable while it's
"thinking", where that may be a showstopper for the users. (Such a
something at minimum needs to use a separate, lower-priority thread for
the "thinking" and maybe explicitly yield now and again. I've also seen
systems slowed to a crawl by software that grovels over the whole
filesystem for hours, because of contention for disk access rather than
CPU use. The antivirus on my development box runs daily at 8 am,
generally taking until 10 to finish, and the system's a pain in the ass
to use during those two hours. Of course, a virus-riddled system would
be an even bigger pain, so...)
 
Reply With Quote
 
Thomas Hawtin
Guest
Posts: n/a
 
      12-23-2006
gaurav v bagga wrote:
>
> r
> 2405892
> r
> 832229


Yeah, I get:

r
617475
r
187102

Oh, but I switched and did TreeMap first...

> it seems tree is better or is it that I am measuring the speed in wrong
> way?


Most modern JREs use adaptive compilers. They will start interpreting
code and gradually compile bits and pieces, perhaps the same piece a
number of times.

So you need to be extremely careful. Assuming you care about 'hot
spots', then your microbenchmarks should loop many times and alternate
between algorithms. Better, use real code.

One things microbenchmarks don't cover is seas of not very warm code.
Apparently Swing is like this. For good performance there you need to
write less code ("there is no code faster than no code").

Either way you tend to get better performance, surprisingly, by keeping
methods small (Swing is really bad at this). In particular keep
low-level hot spots away from less well used code (Swing is really bad
at this). And factor out common code as best you can (Swing is really
bad at this).

Tom Hawtin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
create collection of collection Hemant ASP .Net 1 10-22-2009 03:04 PM
which collection to use? jimgardener Java 3 06-24-2008 04:17 PM
Collection problems (create Collection object, add data to collection, bind collection to datagrid) Řyvind Isaksen ASP .Net 1 05-18-2007 09:24 AM
Sorting the Collection(java.util.Collection) of Elements Pradeep Java 2 01-24-2007 02:33 PM
STL - an algorithm for finding a collection within a collection? Dylan C++ 5 03-22-2005 01:31 AM



Advertisments