On 5/29/12 3:23 PM, Daniel Pitts wrote:
> On 5/29/12 2:49 PM, Gene Wirchenko wrote:
>> On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew<>
>> wrote:
>>
>>> Gene Wirchenko wrote:
>>>> Daniel Pitts wrote:
>>>>
>>>> [snip]
>>>>
>>>>> Are you arguing that a modern system can't handle that number of
>>>>> words?
>>>>
>>>> No. I simply stating that the real size of the problem is much
>>>> bigger.
>>>
>>> With no numbers that differ from Daniel's to back up your claim.
>>>
>>> You called my numbers "made up", but it turned out they were
>>> *larger* than the real numbers.
>>>
>>> You cite "a quarter of a million" words. Daniel counted roughly
>>> *150%* of that in his word base.
>>
>> Ah, selective reading.
>>
>> For root forms, it was 1/4 million. With affixes -- and remember
>> that my first question was about them -- the figure was 3/4 million.
>> This is double what Daniel counted, and 3/4 million does not include
>> technical words, etc. Take a look at the *full* paragraph that I
>> quoted, not just the lowest number.
>>
>>> My numbers were generous. Yours are not even significantly different,
>>> and in fact are smaller than his. The numbers do show there is not
>>> much problem, yet somehow you claim with no logic or reasoning or
>>> different data that they do show a problem.
>>
>> Take another look at that paragraph I quoted. Really.
>>
>>> Clearly you are mistaken.
>>>
>>> Daniel showed evidence from experimentation. His numbers jibe with
>>> yours. Without compression, his data occupy roughly 5 MiB of memory.
>>>
>>> Show the problem or withdraw the claim.
>>
>> Read my statement of the problem.
>>
>>>>> A modern desktop has more than enough memory to easily handle a
>>>>> quarter
>>>>> *billion* words, which is a 100 times greater than your guessed
>>>>> upper limit.
>>>>>
>>>>> And that's *without* compression.
>>>>
>>>> Sure. If that is all that it does. My main (and older) desktop
>>>> box has 1.5 GB. I have trouble with not enough memory at times.
>>>> Adding another app might break its back.
>>>
>>> Again, how much damage will< 5 MiB of data do to that system?
>>>
>>> How about 50 MiB? That's *ten times* the number of words you might
>>> need to handle.
>>> Without any compression.
>>
>> Fine. My system is currently using 1547 MB of memory. It only
>> has 1536 MB. With some intensive uses, I have seen the commit go to
>> as high as about 2200 MB. My system crawls then. Using 50 MB more in
>> those circumstances would not be good.
>>
>>> You haven't shown a problem. You just chant, "The sky is falling!"
>>
>> I believe that I have. Handwaving the possibility away does not
>> make it go away.
>>
>> Sincerely,
>>
>> Gene Wirchenko
>
>
> Test program:
>
> package net.virtualinfinity.moby;
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Arrays;
>
> public class WordTree {
> WordNode start = new WordNode(-1);
>
> public void addWord(String word) {
> start.addWord(word);
> }
>
> public static void main(String[] args) throws IOException {
> WordTree tree = new WordTree();
> int wordsLoaded = 0;
> for (String filename: args) {
> System.out.println("Loading " + filename);
> BufferedReader reader = new BufferedReader(new FileReader(filename),
> 1024*256);
> String line;
> while ((line = reader.readLine()) != null) {
> ++wordsLoaded;
> if ((wordsLoaded & 127) == 0) {
> printStatus(wordsLoaded);
> }
> tree.addWord(line.trim());
> }
> System.gc();
> printStatus(wordsLoaded);
> System.out.println();
> }
> }
>
> private static void printStatus(int wordsLoaded) {
> System.out.print("\rWords loaded: " + wordsLoaded + ", Total Memory
> used: " + Runtime.getRuntime().totalMemory() + ". ");
> }
> }
>
>
> class WordNode {
> private boolean terminal;
> private final int value;
> private WordNode[] next;
>
> public WordNode(int ch) {
> value = ch;
> }
>
> public void addWord(String word) {
> if ("".equals(word)) {
> terminal = true;
> return;
> }
> final int ch = word.codePointAt(0);
> getNext(ch).addWord(word.substring(Character.charC ount(ch)));
> }
>
> private WordNode getNext(int ch) {
> if (next == null) {
> next = new WordNode[1];
> next[0] = new WordNode(ch);
> }
> for (WordNode node: next) {
> if (node.value == ch) {
> return node;
> }
> }
> next = Arrays.copyOf(next, next.length +1);
> final WordNode newNode = new WordNode(ch);
> next[next.length-1] = newNode;
> return newNode;
> }
> }
>
> ------- Output -----
>
> Loading compound-words.txt
> Words loaded: 256772, Total Memory used: 120909824.
> Loading often-mispelled.txt
> Words loaded: 257138, Total Memory used: 121106432.
> Loading english-most-frequent.txt
> Words loaded: 258141, Total Memory used: 121106432.
> Loading male-names.txt
> Words loaded: 262038, Total Memory used: 121106432.
> Loading female-names.txt
> Words loaded: 266984, Total Memory used: 121106432.
> Loading common-names.txt
> Words loaded: 288970, Total Memory used: 121106432.
> Loading common-dictionary.txt
> Words loaded: 363520, Total Memory used: 127373312.
> Loading official-scrabble-1st-edition.txt
> Words loaded: 477329, Total Memory used: 129957888.
> Loading official-scrabble-2nd-edition-delta.txt
> Words loaded: 481489, Total Memory used: 129957888.
BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB
So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.
Extrapolating that data to an extreme 2 million words, that would be
less than 200MB in memory.
My gut feeling beats your gut feeling, and my science proves it true. If
you are going to reply with a counter argument, please provide a
reproducible experiment to prove your argument. Otherwise, this
conversation is over.
|