Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Speeding up matching Strings in a set

Reply
Thread Tools

Speeding up matching Strings in a set

 
 
Volker Borchert
Guest
Posts: n/a
 
      10-25-2008
softwarepearls_com wrote:

|> Well, that the precise data inputs matter a lot (see benchmark code
|> below). For the test as written, the custom logic approach is roughly
|> twice as fast. But.. I saw that the length of input Strings matter a
|> lot, which is to be expected since the longer the Strings, the deeper
|> the if nesting gets. It's a race between HashSet's O(1) logic vs my
|> O(n) logc.

Only if the strings to test are re"used". The String.hashCode() takes
time proportional to String.length() the first time it is called for
any given String instance.

|> I also had a quick look at the generated bytecode (using Eclipse's
|> compiler, javac fans YMMV).. and found some recurring sillyness which
|> would not feature in the hand generated logic, so the custom approach
|> would give a significant performance advantage over HashSet.contains,
|> but only for really short Strings. Mine are about 3-5x longer than
|> those in the benchmark.. so for me, HashSet.contains() and String's
|> clever hashCode() wins.

The key might be code and data locality. How do the JIT compiled code
and the hashtable fit into instruction and data caches and vm pages?

BTW, have you tested switch against elsif?

--

"I'm a doctor, not a mechanic." Dr Leonard McCoy <(E-Mail Removed)>
"I'm a mechanic, not a doctor." Volker Borchert <(E-Mail Removed)>
 
Reply With Quote
 
 
 
 
Daniel Pitts
Guest
Posts: n/a
 
      10-26-2008
softwarepearls_com wrote:
> Let's say you had a set of Strings (I deliberately use lower case, so
> not necessarily java.util.Set), and you wanted to know if String x was
> in this set.. what would be the fastest way?
>
> Anyone have any experience comparing the performance of
> HashSet<String> against regular expressions?


HashSet is plenty fast in my experience. If you really wanted to get
"the fastest", you'd have to do some analysis.

All String comparison's (HashSet or FSA, or other) have the worst case
of O(n), which is when all but the last character checked matches. If
*all* of your strings start with "foo", but end unpredictably, then you
would be better matching backwards. If every string is completely
random, then a FSA or HashSet would be similar in performance.

There may be some other heuristic to help the worst case.

Now, after you've done that, you might also do some more analysis and
find that all your strings have characters that are between 32 and 196.

Then, you can use a FSA where each node contains an array of length 164
that holds the transition state. Depending on the number of strings
(and the uniformity of strings), you might have enough memory to
construct such an FSA which includes all your strings. This would
result in an algorithm that for every character in the string-to-check
uses two compares (to ensure the char is in range), a subtraction, an
array index lookup, another compare (for null), and a loop for the next
char. Then one final comparison for the final "isTerminal" state.

The alternative is: calculate the hash of the string-to-check (which
itself is n addition/multiplication operations), and then at least one
array access, and then n*m comparisons, where n is the length of the
string to check, and m is the number of hash collisions.

So, depending on the set size, data size, number of checks per second,
CPU power, and memory, you should try both approaches and see which
matches your requirements best.

Hope this helps.
--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
 
Reply With Quote
 
 
 
 
Mark Space
Guest
Posts: n/a
 
      10-26-2008
Robert Klemme wrote:

> Actually I believe I have seen an implementation that does not read all
> chars but jumps for longer strings. I would have to do more research
> though to find out which version of JDK it was or whether I mixed this
> up with a string hashing function in another library / programming
> language.


Joshua Bloch talks about this in Effective Java, 2nd ed. It was 16
characters max, and the algorithm jumped around for strings that were
longer than 16 characters, iirc. I don't recall in what version of Java
this was fixed, however.
 
Reply With Quote
 
Mark Space
Guest
Posts: n/a
 
      10-26-2008
softwarepearls_com wrote:

> The setup costs of any matching approach are not relevant for my
> application.. because the matching will be done a huge (10^[5..7])
> number of times..
>


Considering this and your other post (the one talking about optimizing
68k programming), have you considered JNI? How about a database with
some well choosen indexes? (How big is this set here?)

For JNI, you could possibly construct the HashSet first with Java, then
export the internal arrays to JNI C structs. Then write some custom
code to do the quick look-up.

Similar idea with using a database: eliminate as much Java as possible
and use (presumably) faster code to do the work that needs optimizing.

 
Reply With Quote
 
Mark Space
Guest
Posts: n/a
 
      10-26-2008
softwarepearls_com wrote:

>
> The set would hold very few Strings.. maybe a maximum of two dozen.
> The Strings would average a length of 20-30 chars. Many of them could
> share prefixes that have lengths of maybe half the String's length.
> And finally, I would be doing tens, hundreds of thousands of lookups.


OK, if it's only like 2 dozen, I'd consider something like a hash on
each character, so you don't have to even scan the whole input string
(to compute a hash).

OTOH, I have a hard time believing that this will be your actual
bottleneck. I'd write the program with just a HashSet, then profile it
and make sure the whole thing isn't IO bound or something.

Remember how to optimize:

1. Don't do it.
2. Don't do it yet.
3. Make sure you have a correct, working, clear solution first.
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      10-26-2008
softwarepearls_com wrote:
> Let's say you had a set of Strings (I deliberately use lower case, so
> not necessarily java.util.Set), and you wanted to know if String x was
> in this set.. what would be the fastest way?
>
> Anyone have any experience comparing the performance of
> HashSet<String> against regular expressions?


I just did an experiment. I actually had a similar situation, but
instead of "contains", I needed a map (from name to an object). I
originally had used a HashMap, which was fast enough, but this thread
got me thinking. I created a StringMap class which is *almost* a Map
(no iteration). The StringMap class utilized a Deterministic Finite
State Automata.

I ran my benchmarks and the speed as nearly the same between both the
StringMap<E> and the HashMap<String, E>. Surprisingly the
HashMap<String, E> performed slightly better. Not by enough that it
would sway my decision, except the fact that HashMap is existing and
well supported.

Now, in your situation there may be other factors, but I would surprised
if it was that much of a difference for you either.

Note, the size of the String keyset was in the 9000 range, so that may
affect the results.

Hope this helps,
Daniel.
--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
 
Reply With Quote
 
Patricia Shanahan
Guest
Posts: n/a
 
      10-26-2008
Daniel Pitts wrote:
> softwarepearls_com wrote:
>> Let's say you had a set of Strings (I deliberately use lower case, so
>> not necessarily java.util.Set), and you wanted to know if String x was
>> in this set.. what would be the fastest way?
>>
>> Anyone have any experience comparing the performance of
>> HashSet<String> against regular expressions?

>
> I just did an experiment. I actually had a similar situation, but
> instead of "contains", I needed a map (from name to an object). I
> originally had used a HashMap, which was fast enough, but this thread
> got me thinking. I created a StringMap class which is *almost* a Map
> (no iteration). The StringMap class utilized a Deterministic Finite
> State Automata.
>
> I ran my benchmarks and the speed as nearly the same between both the
> StringMap<E> and the HashMap<String, E>. Surprisingly the
> HashMap<String, E> performed slightly better. Not by enough that it
> would sway my decision, except the fact that HashMap is existing and
> well supported.
>
> Note, the size of the String keyset was in the 9000 range, so that may affect the results.



Modern processors do computation very fast, but achieve that partly by
having a lot of things going on at once, including work on instructions
that depend on conditional branches. Two things tend to slow them down.
Cache misses are expensive because access to memory takes a lot longer
than transfers inside the chip. Conditional branches are expensive,
unless correctly predicted, because the processor has to throw away a
lot of partly done work and switch to processing the correct next
instruction.

The number of cache misses is probably more-or-less independent of the
choice of method. However, the state machine and similar methods may
have a lot more conditional branches, and a lot more opportunities for
the processor to guess wrong, than the HashSet method.

Of course, the results will be influenced by data dependent issues such
as how quickly the state machine can reject non-matching inputs.

Patricia
 
Reply With Quote
 
John B. Matthews
Guest
Posts: n/a
 
      10-26-2008
In article <ge0j8l$egv$(E-Mail Removed)>,
Mark Space <(E-Mail Removed)> wrote:

> Robert Klemme wrote:
>
> > Actually I believe I have seen an implementation that does not read all
> > chars but jumps for longer strings. I would have to do more research
> > though to find out which version of JDK it was or whether I mixed this
> > up with a string hashing function in another library / programming
> > language.

>
> Joshua Bloch talks about this in Effective Java, 2nd ed. It was 16
> characters max, and the algorithm jumped around for strings that were
> longer than 16 characters, iirc. I don't recall in what version of Java
> this was fixed, however.


At the end of item 9, he mentions: "The String hash function implemented
in all releases prior to 1.2 examined at most sixteen characters, evenly
spaced throughout the string..." The even spacing was a problem for
hierarchical strings like URLs and, I suppose, package names.

--
John B. Matthews
trashgod at gmail dot com
http://home.roadrunner.com/~jbmatthews/
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      10-26-2008
On 26.10.2008 09:53, John B. Matthews wrote:
> In article <ge0j8l$egv$(E-Mail Removed)>,
> Mark Space <(E-Mail Removed)> wrote:
>
>> Robert Klemme wrote:
>>
>>> Actually I believe I have seen an implementation that does not read all
>>> chars but jumps for longer strings. I would have to do more research
>>> though to find out which version of JDK it was or whether I mixed this
>>> up with a string hashing function in another library / programming
>>> language.

>> Joshua Bloch talks about this in Effective Java, 2nd ed. It was 16
>> characters max, and the algorithm jumped around for strings that were
>> longer than 16 characters, iirc. I don't recall in what version of Java
>> this was fixed, however.

>
> At the end of item 9, he mentions: "The String hash function implemented
> in all releases prior to 1.2 examined at most sixteen characters, evenly
> spaced throughout the string..." The even spacing was a problem for
> hierarchical strings like URLs and, I suppose, package names.


Mark, John, thanks for helping my feeble memory!

Kind regards

robert
 
Reply With Quote
 
softwarepearls_com
Guest
Posts: n/a
 
      10-26-2008
On 25 Oct 2008 19:03:27 GMT, http://www.velocityreviews.com/forums/(E-Mail Removed) (Volker
Borchert) wrote:

>softwarepearls_com wrote:
>
>|> Well, that the precise data inputs matter a lot (see benchmark code
>|> below). For the test as written, the custom logic approach is roughly
>|> twice as fast. But.. I saw that the length of input Strings matter a
>|> lot, which is to be expected since the longer the Strings, the deeper
>|> the if nesting gets. It's a race between HashSet's O(1) logic vs my
>|> O(n) logc.
>
>Only if the strings to test are re"used". The String.hashCode() takes
>time proportional to String.length() the first time it is called for
>any given String instance.


Sure.. but in my situation they would very much be reused. In fact, I
strongly suspect that they're even interned Strings.. which, if true,
would open up an even faster optimization.

>The key might be code and data locality. How do the JIT compiled code
>and the hashtable fit into instruction and data caches and vm pages?
>
>BTW, have you tested switch against elsif?


I haven;t delved deeper than the benchmark code I wrote. Like another
poster suggested, I can't lose track of the remainder of the
application... so I settled with the HashSet approach, for the time
being.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Find closest matching string based on collection of strings inlist/dict/set Joel Goldstick Python 1 08-31-2010 09:24 PM
matching strings in a large set of strings Karin Lagesen Python 13 05-03-2010 03:53 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
How to replace all strings matching a pattern with correspondinglower case strings ? anonym Java 1 01-15-2009 07:29 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM



Advertisments