Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Patricia trie vs binary search.

Reply
Thread Tools

Patricia trie vs binary search.

 
 
Daniel Pitts
Guest
Posts: n/a
 
      05-29-2012
On 5/29/12 3:23 PM, Daniel Pitts wrote:
> On 5/29/12 2:49 PM, Gene Wirchenko wrote:
>> On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew<(E-Mail Removed)>
>> wrote:
>>
>>> Gene Wirchenko wrote:
>>>> Daniel Pitts wrote:
>>>>
>>>> [snip]
>>>>
>>>>> Are you arguing that a modern system can't handle that number of
>>>>> words?
>>>>
>>>> No. I simply stating that the real size of the problem is much
>>>> bigger.
>>>
>>> With no numbers that differ from Daniel's to back up your claim.
>>>
>>> You called my numbers "made up", but it turned out they were
>>> *larger* than the real numbers.
>>>
>>> You cite "a quarter of a million" words. Daniel counted roughly
>>> *150%* of that in his word base.

>>
>> Ah, selective reading.
>>
>> For root forms, it was 1/4 million. With affixes -- and remember
>> that my first question was about them -- the figure was 3/4 million.
>> This is double what Daniel counted, and 3/4 million does not include
>> technical words, etc. Take a look at the *full* paragraph that I
>> quoted, not just the lowest number.
>>
>>> My numbers were generous. Yours are not even significantly different,
>>> and in fact are smaller than his. The numbers do show there is not
>>> much problem, yet somehow you claim with no logic or reasoning or
>>> different data that they do show a problem.

>>
>> Take another look at that paragraph I quoted. Really.
>>
>>> Clearly you are mistaken.
>>>
>>> Daniel showed evidence from experimentation. His numbers jibe with
>>> yours. Without compression, his data occupy roughly 5 MiB of memory.
>>>
>>> Show the problem or withdraw the claim.

>>
>> Read my statement of the problem.
>>
>>>>> A modern desktop has more than enough memory to easily handle a
>>>>> quarter
>>>>> *billion* words, which is a 100 times greater than your guessed
>>>>> upper limit.
>>>>>
>>>>> And that's *without* compression.
>>>>
>>>> Sure. If that is all that it does. My main (and older) desktop
>>>> box has 1.5 GB. I have trouble with not enough memory at times.
>>>> Adding another app might break its back.
>>>
>>> Again, how much damage will< 5 MiB of data do to that system?
>>>
>>> How about 50 MiB? That's *ten times* the number of words you might
>>> need to handle.
>>> Without any compression.

>>
>> Fine. My system is currently using 1547 MB of memory. It only
>> has 1536 MB. With some intensive uses, I have seen the commit go to
>> as high as about 2200 MB. My system crawls then. Using 50 MB more in
>> those circumstances would not be good.
>>
>>> You haven't shown a problem. You just chant, "The sky is falling!"

>>
>> I believe that I have. Handwaving the possibility away does not
>> make it go away.
>>
>> Sincerely,
>>
>> Gene Wirchenko

>
>
> Test program:
>
> package net.virtualinfinity.moby;
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Arrays;
>
> public class WordTree {
> WordNode start = new WordNode(-1);
>
> public void addWord(String word) {
> start.addWord(word);
> }
>
> public static void main(String[] args) throws IOException {
> WordTree tree = new WordTree();
> int wordsLoaded = 0;
> for (String filename: args) {
> System.out.println("Loading " + filename);
> BufferedReader reader = new BufferedReader(new FileReader(filename),
> 1024*256);
> String line;
> while ((line = reader.readLine()) != null) {
> ++wordsLoaded;
> if ((wordsLoaded & 127) == 0) {
> printStatus(wordsLoaded);
> }
> tree.addWord(line.trim());
> }
> System.gc();
> printStatus(wordsLoaded);
> System.out.println();
> }
> }
>
> private static void printStatus(int wordsLoaded) {
> System.out.print("\rWords loaded: " + wordsLoaded + ", Total Memory
> used: " + Runtime.getRuntime().totalMemory() + ". ");
> }
> }
>
>
> class WordNode {
> private boolean terminal;
> private final int value;
> private WordNode[] next;
>
> public WordNode(int ch) {
> value = ch;
> }
>
> public void addWord(String word) {
> if ("".equals(word)) {
> terminal = true;
> return;
> }
> final int ch = word.codePointAt(0);
> getNext(ch).addWord(word.substring(Character.charC ount(ch)));
> }
>
> private WordNode getNext(int ch) {
> if (next == null) {
> next = new WordNode[1];
> next[0] = new WordNode(ch);
> }
> for (WordNode node: next) {
> if (node.value == ch) {
> return node;
> }
> }
> next = Arrays.copyOf(next, next.length +1);
> final WordNode newNode = new WordNode(ch);
> next[next.length-1] = newNode;
> return newNode;
> }
> }
>
> ------- Output -----
>
> Loading compound-words.txt
> Words loaded: 256772, Total Memory used: 120909824.
> Loading often-mispelled.txt
> Words loaded: 257138, Total Memory used: 121106432.
> Loading english-most-frequent.txt
> Words loaded: 258141, Total Memory used: 121106432.
> Loading male-names.txt
> Words loaded: 262038, Total Memory used: 121106432.
> Loading female-names.txt
> Words loaded: 266984, Total Memory used: 121106432.
> Loading common-names.txt
> Words loaded: 288970, Total Memory used: 121106432.
> Loading common-dictionary.txt
> Words loaded: 363520, Total Memory used: 127373312.
> Loading official-scrabble-1st-edition.txt
> Words loaded: 477329, Total Memory used: 129957888.
> Loading official-scrabble-2nd-edition-delta.txt
> Words loaded: 481489, Total Memory used: 129957888.


BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB

So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.

Extrapolating that data to an extreme 2 million words, that would be
less than 200MB in memory.

My gut feeling beats your gut feeling, and my science proves it true. If
you are going to reply with a counter argument, please provide a
reproducible experiment to prove your argument. Otherwise, this
conversation is over.
 
Reply With Quote
 
 
 
 
Gene Wirchenko
Guest
Posts: n/a
 
      05-29-2012
On Tue, 29 May 2012 15:39:16 -0700, Daniel Pitts
<(E-Mail Removed)> wrote:

[snip]

>BTW, if I check the memory usage before loading words and after, the
>difference is ~ 42MB
>
>So, loading 481k words takes up about 42MB. This is in java, which has a
>fairly high overhead per string. And the implementation of my data
>structure is also fairly naive as well.


So about 100 bytes per word.

>Extrapolating that data to an extreme 2 million words, that would be
>less than 200MB in memory.


Or about 100 MB for my SWAG of one million words.

>My gut feeling beats your gut feeling, and my science proves it true. If


It sure does. Lew's figures (up to 10 MB) were considerably
lower, and that is what I was objecting to. They seemed too low.

>you are going to reply with a counter argument, please provide a
>reproducible experiment to prove your argument. Otherwise, this
>conversation is over.


Actually, I started with a question, namely
Including all affixes?
and I have been trying to get a reasonable answer to it. I knew that
your word counts were low compared to others that I have seen and
wanted to know a realistic memory use figure for the *whole* English
language. You have provided a sufficiently good approximation. Thank
you.

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
 
 
 
Lew
Guest
Posts: n/a
 
      05-30-2012
Gene Wirchenko wrote:
> Daniel Pitts wrote:
>
> [snip]
>
>>BTW, if I check the memory usage before loading words and after, the
>>difference is ~ 42MB
>>
>>So, loading 481k words takes up about 42MB. This is in java, which has a
>>fairly high overhead per string. And the implementation of my data
>>structure is also fairly naive as well.


We've been talking apples and oranges. You refer here to the memory
footprint if the entire word list were in memory as naive Strings. I've been
talking about the footprint in storage (e.g., the SD card of a smartphone).

So, Gene, you have been misinterpreting my point.

There is little reason to hold the dictionary in memory at all times if you
are in a memory-constrained environment. If one does need to hold it
in memory, and one were worried about memory footprint, one would not
hold it in a memory-intensive, naive structure, in RAM all at once.

Compression exceeding 90% is typical with text. Assuming 75% compression,
a very reasonable and safe estimate, the data would occupy about 10 MB, just
as I said. That's if you keep it in memory in compressed form.

A disk-based solution would likely occupy only a few KB at a time.

Nice try to distort the issue into a scary monster, though.

--
Lew
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
A Note Of Appreciation for Patricia Shanahan Roedy Green Java 3 04-27-2006 01:05 AM
patricia mcgowan Graham Stuart Computer Support 1 01-19-2005 08:06 PM
compressed suffix trie Joseph Java 1 09-22-2004 07:23 PM
compressed suffix trie Joseph C++ 3 09-22-2004 06:02 PM
ruby replacement for net::patricia needed jm Ruby 5 05-10-2004 01:04 AM



Advertisments