Roedy Green wrote:
> Is there something like HashMap but that optimised when nearly always
> the thing you are looking up is not in the list, and when you can add
> the list of words to look up and then freeze it.
>
> I have to scan an entire website looking up every word.
Unless said web site is
www.archive.org, just create a HashMap
and use a Collections.unmodifiableMap() wrapper for lookup. I'd
expect memory to become a problem way earlier than processor.
Depending on how often each String is looked up, and how long the
Strings are, a data structure that does not need String.hashCode()
might be faster.
I'd consider a tree along the following lines:
Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
'`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
boolean and a Node[27]. The tree contains a word if there is a path
from the root to a Node whose boolean is true and that is reached via
Node[String.charAt(l)] at level l.
For example, a tree containing exactly
"a", "at", "and", "no", "nil", "not"
would be
N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // ""
N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
N3 = { false, [ null, 13*null, N7, 12*null ] } // "an"
N4 = { true, [ null, 26*null ] } // "at"
N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni"
N6 = { true, [ null, 19*null, N9, 6*null ] } // "no"
N7 = { true, [ null, 26*null ] } // "and"
N8 = { true, [ null, 26*null ] } // "nil"
N9 = { true, [ null, 26*null ] } // "not"
For a Map, replace the boolean with a field holding the value.
Possible improvements:
- Since we have three (32bit VM) or seven (64bit VM) padding bytes
in each Node, we could use two of them to hold a short indicating
the maximum length of Strings in the subtree.
- Instantiate Node[] only if actually needed to hold a child, and only
large enough to hold all currently known children. Use one of the
padding bytes to hold an index offset, such that a Node with only
a single child only holds a Node[1]:
N4 = { true, 0, null }
N6 = { true, 20, [ N9 ] }
- Use an interface and specialized concrete classes for
- leaf node (no fields at all, because boolean will always be true)
- single child node { boolean, Node }
or even "true" and "false" single child nodes
- ...
but this will considerable increase preprocessing cost.
- If many of your words contain non-[a-z] characters, provide array
positions for the most common of these, or use a linear probing
hash table for the non-[a-z] children.
- If the tree is very sparse, doing so even for the [a-z] children
might be better in the long run because more of it fits into caches.
This surely has been invented before and given a nice name...
Apache's or Google's collection libraries might even feature
high quality implementations.
--
"I'm a doctor, not a mechanic." Dr Leonard McCoy <>
"I'm a mechanic, not a doctor." Volker Borchert <>