Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > optimsed HashMap

Reply
Thread Tools

optimsed HashMap

 
 
Arne Vajh°j
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 1:42 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
> indirectly quoted someone who said :
>
>> I'm not sure what you are trying to say there. You want the case where
>> you do not find something in a hash map to be optimized? "Optimized" how?

>
>> What do you mean "add to the list of words" and "freeze"?

>
> The following is not the real problem, but it might more simply
> illustrate what I am asking.
>
> Think of an ordinary HashMap<String,String>
>
> What it does is translate a few English words with French derivation,
> putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
> -> Napol&acute;on
>
> Let us say you have 100 such words you want to transform. (In my
> actual problem I have about 1500 words).
>
> You go through the files for a website looking at each word of text
> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>
> Most of the time word you look up is not in the list.
>
> This is a time-consuming process. I would like to speed it up.


Reading the HTML files, splitting them up in words and discarding
HTML seems more likely to be the bottleneck than the lookups in a
HashMap.

What does your measurements show CPU time and wall time
distributed between the different activities?

You did measure right??

> My lookup has two properties that might be exploited in some variant
> HashMap.
>
> 1. nearly always the lookup fails. The code should be optimised for
> this case. If it has some fast way of knowing the elt is not there,
> it should do that first.


HashMap should already do that.

> 2. the list of words to lookup does not change after initial
> preparation. I can afford to do some special calculation to prime the
> lookup. For example, I once heard of some tweaking to avoid long
> collision chains for a C implementation of HashMap.


If you have long collision chains then you can obviously improve
by choosing a better hash functions.

But do you have long collision chains?

Java String hash is not that bad.

You can easily test number of collisions.

> Another way of looking at the problem is it would be nice to have a
> HashSet implementation that was considerably faster than a HashMap.
> IIRC, currently HashSet is implemented as a HashMap.


Not carrying data will not make it considerable faster.

Hashing and lookup are the same.

Arne



 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      11-24-2012
On 11/23/2012 10:42 PM, Roedy Green wrote:

> My question had two purposes. To see if there was something available
> off the shelf, and to stimulate thought on some new algorithm that
> could have wider application that just my problem.



I think it's unlikely that you're going to get a better algorithm than
Boyer-Moore or some other standard algorithm. Those are standard
algorithms for a reason: no one's come up with anything better.

You might try compiling a regex to do it. Regex can be pretty efficient
in many cases. Nothing can beat a hand code parser though, so if you
really need speed, that's the ticket.

Example regex:

"\\b(word1|word2|word3|...etc.)\\b"

You also might consider the case where you do have several matches in
your character buffer. If you use a string and standard replacement,
you'll copy the entire buffer for each match. If the buffer is large,
that's a lot of time spent just copying the same bytes around. Look
into making your own buffer/string that doesn't require making copies
for each replacement. However if you really do seldom have a match,
this might not be worth your time investment.




 
Reply With Quote
 
 
 
 
Knute Johnson
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 08:39 AM, Knute Johnson wrote:
> On 11/24/2012 3:34 AM, Roedy Green wrote:
>> On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
>> <(E-Mail Removed)> wrote, quoted or indirectly quoted
>> someone who said :
>>
>>> Another way of looking at the problem is it would be nice to have a
>>> HashSet implementation that was considerably faster than a HashMap.
>>> IIRC, currently HashSet is implemented as a HashMap.

>>
>> A a 3- valued HashSet.contains would still be useful
>> no -- not in set
>> yes - in set
>> maybe -- don't know
>>
>> At least you could eliminate the no's from further processing.
>>

>
> What about a sorted map? I would think searching a sorted map would be
> much quicker than an unsorted one.
>


Actually it's not faster, I tried it.

knute...

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      11-25-2012
On 25.11.2012 14:50, Chris Uppal wrote:
> Robert Klemme wrote:


> Alternatively, use a real DFA matcher which will do all of that work for you as
> it creates a minimal (or maybe only nearly mininal) DFA.


Right, that's probably an even better option.

>> Not sure though whether it is dramatically faster or slower than a standard
>> string search like Boyer-Moore - probably not.

>
> As I mentioned elsewhere BM is no longer considered the best (there's a nice,
> albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern
> Matching in Strings").


Thank you for that pointer, Chris!

> But my gut feeling (another way of saying that I
> haven't bothered to test it) is that IO would probably dominate the exection
> time whatever algorithm was used (assuming it wasn't implemented
> inefficiently).


Yes, that seems likely.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
Silvio
Guest
Posts: n/a
 
      11-26-2012
On 11/24/2012 02:12 AM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>


The use case you describe sounds like a good fit for a character trie.
The root represents the empty input string and each node has a child
node for each character extension of its current input path that needs
to be mapped. A node can optionally have a replacement string if it
marks the end of a word-to-be-mapped. If it doesn't than it marks a
sub-word only.
You simply read word characters and step down into the trie. If you walk
off the tree halfway through a word or if you end on a node that has no
replacement value then the word does not exist and you need to copy it
to your output. If the last character of a word leads to a node with a
replacement string than that is what you output.

Setting up the trie might take more time than setting up a plain word
map but the processing should be a lot faster.

Silvio

 
Reply With Quote
 
Jim Janney
Guest
Posts: n/a
 
      11-26-2012
Roedy Green <(E-Mail Removed)> writes:

> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.


In this case, where you are always looking up a newly created string,
it's likely that computing the hash code takes more time than the actual
lookup: java.lang.String.hashCode() looks at every character in the
string. You could experiment with a weaker hash code that is cheaper to
compute but generates more collisions, for example only hash the first n
characters for some small value of n.

Or take the length of the string (very cheap to compute) and check it
against a BitSet. Or do the same with the first (or last) character.
How well these strategies work will depend very much on the actual data.

--
Jim Janney
 
Reply With Quote
 
Daniele Futtorovic
Guest
Posts: n/a
 
      11-26-2012
On 24/11/2012 07:42, Roedy Green allegedly wrote:
> On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
> indirectly quoted someone who said :
>
>> I'm not sure what you are trying to say there. You want the case where
>> you do not find something in a hash map to be optimized? "Optimized" how?

>
>> What do you mean "add to the list of words" and "freeze"?

>
> The following is not the real problem, but it might more simply
> illustrate what I am asking.
>
> Think of an ordinary HashMap<String,String>
>
> What it does is translate a few English words with French derivation,
> putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
> -> Napol&acute;on
>
> Let us say you have 100 such words you want to transform. (In my
> actual problem I have about 1500 words).
>
> You go through the files for a website looking at each word of text
> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>
> Most of the time word you look up is not in the list.
>
> This is a time-consuming process. I would like to speed it up.


You might want to intern() the input to avoid having to recompute the
hash every time (if applicable). Other than that, you'll either be
wanting a better hashing algorithm, to avoid collisions, or indeed
something altogether fancier (but riskier in terms or RoI).

*shrug*

--
DF.
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      11-26-2012
On 11/26/2012 10:03 PM, Daniele Futtorovic wrote:
> On 24/11/2012 07:42, Roedy Green allegedly wrote:


>> You go through the files for a website looking at each word of text
>> (avoiding HTML markup) in the HashMap. If you find it you replace it.
>>
>> Most of the time word you look up is not in the list.
>>
>> This is a time-consuming process. I would like to speed it up.

>
> You might want to intern() the input to avoid having to recompute the
> hash every time (if applicable). Other than that, you'll either be
> wanting a better hashing algorithm, to avoid collisions, or indeed
> something altogether fancier (but riskier in terms or RoI).


How would interning help? The input is read only once anyway and if you
mean to intern individual words of the input then how does the JVM do
the interning? My guess would be that some form of hashing would be
used there as well - plus that internal data structure must be thread
safe...

Kind regards

robert
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      11-26-2012
On 11/23/12 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
>


I don't think your use-case needs this kind of (read "micro")
optimization. However, I have often wished for a way to get a Map.Entry
for a key, which could then be used to insert a new value efficiently.

Map.Entry<K,V> getOrAddEntry(K key);


Map.Entry<K,V> e = map.getOrAddEntry(myKey);

if (e.getValue() == null) {
e.setValue(newValue(myKey));
}

useValue(e.getValue());

Alas, not exactly possible to add it now.
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      11-27-2012
On 11/26/2012 6:44 PM, Daniel Pitts wrote:
> On 11/23/12 5:12 PM, Roedy Green wrote:
>> Is there something like HashMap but that optimised when nearly always
>> the thing you are looking up is not in the list, and when you can add
>> the list of words to look up and then freeze it.
>>
>> I have to scan an entire website looking up every word.
>>

>
> I don't think your use-case needs this kind of (read "micro")
> optimization. However, I have often wished for a way to get a Map.Entry
> for a key, which could then be used to insert a new value efficiently.
>
> Map.Entry<K,V> getOrAddEntry(K key);
>
>
> Map.Entry<K,V> e = map.getOrAddEntry(myKey);
>
> if (e.getValue() == null) {
> e.setValue(newValue(myKey));
> }
>
> useValue(e.getValue());
>
> Alas, not exactly possible to add it now.


java.util.concurrent.ConcurrentMap has putIfAbsent().

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)d
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
reuse HashMap$Entry (or HashMap in total) to avoid millions of allocations Vince Darley Java 4 03-02-2010 07:48 AM
java.util.Properties extending from HashMap<Object, Object> insteadof HashMap<String, String> Rakesh Java 10 04-08-2008 04:22 AM
HashMap vs TreeMap Ahmed Moustafa Java 2 08-10-2003 03:31 AM
Re: Properties2 extends Hashmap, pros and cons? Jon Skeet Java 5 07-08-2003 06:44 PM
HashMap Sanjay Kumar Java 2 07-05-2003 07:10 PM



Advertisments