Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > optimsed HashMap

Reply
Thread Tools

optimsed HashMap

 
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
Is there something like HashMap but that optimised when nearly always
the thing you are looking up is not in the list, and when you can add
the list of words to look up and then freeze it.

I have to scan an entire website looking up every word.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
 
 
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/23/2012 8:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.


HashMap get is already O(1). There is very little
room for optimization.

Arne


 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      11-24-2012
On 11/23/2012 5:12 PM, Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.



I'm not sure what you are trying to say there. You want the case where
you do not find something in a hash map to be optimized? "Optimized" how?

What do you mean "add to the list of words" and "freeze"?

>
> I have to scan an entire website looking up every word.
>


Google for "search engine." Wikipedia has an entry for it.


 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>I'm not sure what you are trying to say there. You want the case where
>you do not find something in a hash map to be optimized? "Optimized" how?


>What do you mean "add to the list of words" and "freeze"?


The following is not the real problem, but it might more simply
illustrate what I am asking.

Think of an ordinary HashMap<String,String>

What it does is translate a few English words with French derivation,
putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
-> Napol&acute;on

Let us say you have 100 such words you want to transform. (In my
actual problem I have about 1500 words).

You go through the files for a website looking at each word of text
(avoiding HTML markup) in the HashMap. If you find it you replace it.

Most of the time word you look up is not in the list.

This is a time-consuming process. I would like to speed it up.

My lookup has two properties that might be exploited in some variant
HashMap.

1. nearly always the lookup fails. The code should be optimised for
this case. If it has some fast way of knowing the elt is not there,
it should do that first.

2. the list of words to lookup does not change after initial
preparation. I can afford to do some special calculation to prime the
lookup. For example, I once heard of some tweaking to avoid long
collision chains for a C implementation of HashMap.

My question had two purposes. To see if there was something available
off the shelf, and to stimulate thought on some new algorithm that
could have wider application that just my problem.

Another way of looking at the problem is it would be nice to have a
HashSet implementation that was considerably faster than a HashMap.
IIRC, currently HashSet is implemented as a HashMap.

Such an algorithm could be used to fix your most common spelling
mistakes, to add links to magic words, to add markup to magic words
to find and report the presence of certain words, or in my case find
acronyms and replace them with a macro for that acronym that displays
the meaning of the acronym the first time it is used on a page.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
Volker Borchert
Guest
Posts: n/a
 
      11-24-2012
Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.


Unless said web site is www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.

Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.

I'd consider a tree along the following lines:

Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.

For example, a tree containing exactly
"a", "at", "and", "no", "nil", "not"
would be

N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // ""
N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] } // "an"
N4 = { true, [ null, 26*null ] } // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni"
N6 = { true, [ null, 19*null, N9, 6*null ] } // "no"
N7 = { true, [ null, 26*null ] } // "and"
N8 = { true, [ null, 26*null ] } // "nil"
N9 = { true, [ null, 26*null ] } // "not"

For a Map, replace the boolean with a field holding the value.

Possible improvements:

- Since we have three (32bit VM) or seven (64bit VM) padding bytes
in each Node, we could use two of them to hold a short indicating
the maximum length of Strings in the subtree.

- Instantiate Node[] only if actually needed to hold a child, and only
large enough to hold all currently known children. Use one of the
padding bytes to hold an index offset, such that a Node with only
a single child only holds a Node[1]:
N4 = { true, 0, null }
N6 = { true, 20, [ N9 ] }

- Use an interface and specialized concrete classes for
- leaf node (no fields at all, because boolean will always be true)
- single child node { boolean, Node }
or even "true" and "false" single child nodes
- ...
but this will considerable increase preprocessing cost.

- If many of your words contain non-[a-z] characters, provide array
positions for the most common of these, or use a linear probing
hash table for the non-[a-z] children.

- If the tree is very sparse, doing so even for the [a-z] children
might be better in the long run because more of it fits into caches.

This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.

--

"I'm a doctor, not a mechanic." Dr Leonard McCoy <>
"I'm a mechanic, not a doctor." Volker Borchert <>
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
<> wrote, quoted or indirectly quoted
someone who said :

>Another way of looking at the problem is it would be nice to have a
>HashSet implementation that was considerably faster than a HashMap.
>IIRC, currently HashSet is implemented as a HashMap.


A a 3- valued HashSet.contains would still be useful
no -- not in set
yes - in set
maybe -- don't know

At least you could eliminate the no's from further processing.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-24-2012
On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
<> wrote, quoted or indirectly
quoted someone who said :

>Look into the literature on fast text searching (for instance bit-parallel
>matching). It's not entirely clear to me what Roedy is trying to do, but it
>sounds as if "bulk" matching/searching might be relevant.


Yes a Boyer-Moore to simultaneously search for the whole list of
words, then when it has a hit see if it has word in isolation rather
than a word fragment.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      11-24-2012
On 24.11.2012 12:39, Roedy Green wrote:
> On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
> <> wrote, quoted or indirectly
> quoted someone who said :
>
>> Look into the literature on fast text searching (for instance bit-parallel
>> matching). It's not entirely clear to me what Roedy is trying to do, but it
>> sounds as if "bulk" matching/searching might be relevant.

>
> Yes a Boyer-Moore to simultaneously search for the whole list of
> words, then when it has a hit see if it has word in isolation rather
> than a word fragment.


Here's another approach:

1. fill a HashMap with the translations.
2. Create a tree or trie from the keys.
3. Convert the trie to a regular expression optimized for NFA automata
(such as is used in Java std. library).
4. Surround the regexp with additional regexp to ensure word matches and
probably exclude matching inside HTML tags
5. Scan the document with Matcher.find()

The idea of item 3 is to create a regexp with as little backtracking as
possible. For example, from

foo
foot
fuss

you make

(?:f(?ot?)|uss)

Not sure though whether it is dramatically faster or slower than a
standard string search like Boyer-Moore - probably not.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
Knute Johnson
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 3:34 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
> <> wrote, quoted or indirectly quoted
> someone who said :
>
>> Another way of looking at the problem is it would be nice to have a
>> HashSet implementation that was considerably faster than a HashMap.
>> IIRC, currently HashSet is implemented as a HashMap.

>
> A a 3- valued HashSet.contains would still be useful
> no -- not in set
> yes - in set
> maybe -- don't know
>
> At least you could eliminate the no's from further processing.
>


What about a sorted map? I would think searching a sorted map would be
much quicker than an unsorted one.

--

Knute Johnson
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/23/2012 10:51 PM, Patricia Shanahan wrote:
> On 11/23/2012 5:12 PM, Roedy Green wrote:
>> Is there something like HashMap but that optimised when nearly always
>> the thing you are looking up is not in the list, and when you can add
>> the list of words to look up and then freeze it.
>>
>> I have to scan an entire website looking up every word.

>
> Look up "perfect hash". The idea is that, given a set of strings, you
> can calculate a hash function that maps each of those strings to a
> different bucket. That avoids any chain searching due to bucket
> collisions, and simplifies the data structures.


I would expects gains by using a better hash function
compared to the standard Java one to be very small for
String's containing words (English and Java code).

Arne


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
reuse HashMap$Entry (or HashMap in total) to avoid millions of allocations Vince Darley Java 4 03-02-2010 07:48 AM
java.util.Properties extending from HashMap<Object, Object> insteadof HashMap<String, String> Rakesh Java 10 04-08-2008 04:22 AM
HashMap vs TreeMap Ahmed Moustafa Java 2 08-10-2003 03:31 AM
Re: Properties2 extends Hashmap, pros and cons? Jon Skeet Java 5 07-08-2003 06:44 PM
HashMap Sanjay Kumar Java 2 07-05-2003 07:10 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57