Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Patricia trie vs binary search.

Reply
Thread Tools

Patricia trie vs binary search.

 
 
Gene Wirchenko
Guest
Posts: n/a
 
      05-28-2012
On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
<(E-Mail Removed)> wrote:

>On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>> <(E-Mail Removed)> wrote:
>>
>> [snip]
>>
>>> I tend to use a Deterministic Finite State Automata for this. You can
>>> load the entire English dictionary fairly easily with that scheme. Yes,

>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> you use a bit of memory, but unless you're doing this on an embedded
>>> device, its probably not enough memory to be concerned about.

>>
>> Including all affixes?

>I suppose it depends on the particular dictionary, but we're only
>talking a few hundred thousand entries, at least with the Moby
>word-lists as a base:


Considering how many affixes can be applied to some words, I find
that very questionable:
self:
*ish
*ishly
un*ish
un*ishly
*less
*lessly
un*less
un*lessly
position:
*s
*ed
*al
*ally
re*
re*s
re*ed
*less
mis*
*er
*ers
friend
*s
*ly
*liness
be*
be*ing
be*ed
be*er
be*ers
These are not particularly extreme examples.

[snip]

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      05-28-2012
On 5/28/2012 8:20 AM, markspace wrote:
> On 5/27/2012 10:00 PM, Daniel Pitts wrote:
>> the Moby
>> word-lists

>
>
> Moby word lists are neat, thanks for pointing that out.



While I'm thinking about it, here's a link I found from BYU with links
to other corpus and word lists:

http://corpus.byu.edu/



 
Reply With Quote
 
 
 
 
Lew
Guest
Posts: n/a
 
      05-29-2012
On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
> <(E-Mail Removed)> wrote:
>
>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>>> <(E-Mail Removed)> wrote:
>>>
>>> [snip]
>>>
>>>> I tend to use a Deterministic Finite State Automata for this. You can
>>>> load the entire English dictionary fairly easily with that scheme. Yes,
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> you use a bit of memory, but unless you're doing this on an embedded
>>>> device, its probably not enough memory to be concerned about.
>>>
>>> Including all affixes?

>> I suppose it depends on the particular dictionary, but we're only
>> talking a few hundred thousand entries, at least with the Moby
>> word-lists as a base:

>
> Considering how many affixes can be applied to some words, I find
> that very questionable:
> self:
> *ish
> *ishly
> un*ish
> un*ishly
> *less
> *lessly
> un*less
> un*lessly
> position:
> *s
> *ed
> *al
> *ally
> re*
> re*s
> re*ed
> *less
> mis*
> *er
> *ers
> friend
> *s
> *ly
> *liness
> be*
> be*ing
> be*ed
> be*er
> be*ers
> These are not particularly extreme examples.


It's not a question of how extreme the examples are but how many there are.

Not all words can be legitimately affixed. Many can be affixed by algorithm,
or by bitmaps as to which affixes apply, so you only store the root, the
bitmap and perhaps one more form.

I don't know how much memory expansion you think your factors will cause, as
you only hand wave and say there will be some and act like it's a problem, but
let's say it doubles the size of the dictionary. By Daniel's experiment
upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
Being text and all, that should compress to about 3 MiB or less.

Now I am interested to hear what sort of trouble you assert that 3 MiB or so
of storage will cause.

--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedi.../c/cf/Friz.jpg
 
Reply With Quote
 
Jeff Higgins
Guest
Posts: n/a
 
      05-29-2012
On 05/24/2012 07:07 PM, markspace wrote:
> For some reason I was thinking about sub-string searches today. I
> looked up Patricia tries (a kind of radix tree) to see if they would
> help. While interesting, the radix tree seems to have a lot of overhead
> for large numbers of entries.
>
> The radix tree uses a bucket at each level to hold all children (and
> there could be quite a lot of children). Each child if present requires
> a pointer (an object in Java) to hold it. For the example given, this
> could be as much as one object per character in each string, plus the
> bucket to hold it and its siblings. If the number strings is very large,
> this could really result in an explosion of memory usage.
>
> <http://en.wikipedia.org/wiki/Radix_tree>
>
> So is there a better way for large data sets?
>

Different. Better?
<http://www.pathcom.com/~vadco/cwg.html>
 
Reply With Quote
 
Gene Wirchenko
Guest
Posts: n/a
 
      05-29-2012
On Mon, 28 May 2012 21:54:39 -0700, Lew <(E-Mail Removed)> wrote:

>On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>> <(E-Mail Removed)> wrote:
>>
>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:


[snip]

>> These are not particularly extreme examples.

>
>It's not a question of how extreme the examples are but how many there are.


I mentioned extremity just to point out that such base words are
not that unusual. There are a lot of them.

>Not all words can be legitimately affixed. Many can be affixed by algorithm,
>or by bitmaps as to which affixes apply, so you only store the root, the
>bitmap and perhaps one more form.


There are multiple affixes. I suppose they could be combined
into aggregate affixes. e.g. -less + -ness -> -lessness.

>I don't know how much memory expansion you think your factors will cause, as
>you only hand wave and say there will be some and act like it's a problem, but
>let's say it doubles the size of the dictionary. By Daniel's experiment


Let's not make up data. I would like to know how much of the
English language actually is in his dataset.

>upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
>Being text and all, that should compress to about 3 MiB or less.
>
>Now I am interested to hear what sort of trouble you assert that 3 MiB or so
>of storage will cause.


Assuming your facts and then calling me on what you made up?

http://oxforddictionaries.com/words/...glish-language
is short but interesting reading. Its final paragraph: "This suggests
that there are, at the very least, a quarter of a million distinct
English words, excluding inflections, and words from technical and
regional vocabulary not covered by the OED, or words not yet added to
the published dictionary, of which perhaps 20 per cent are no longer
in current use. If distinct senses were counted, the total would
probably approach three quarters of a million."

Now, add on what they exclude. Is one million words out of line?
Itis one thing to have some of a language encoded. It is quite
another to try to handle everything. Exceptional cases tend to be
more difficult to handle.

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      05-29-2012
On 5/28/12 9:54 PM, Lew wrote:
> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>> <(E-Mail Removed)> wrote:
>>
>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>>>> <(E-Mail Removed)> wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I tend to use a Deterministic Finite State Automata for this. You can
>>>>> load the entire English dictionary fairly easily with that scheme.
>>>>> Yes,
>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> you use a bit of memory, but unless you're doing this on an embedded
>>>>> device, its probably not enough memory to be concerned about.
>>>>
>>>> Including all affixes?
>>> I suppose it depends on the particular dictionary, but we're only
>>> talking a few hundred thousand entries, at least with the Moby
>>> word-lists as a base:

>>
>> Considering how many affixes can be applied to some words, I find
>> that very questionable:
>> self:
>> *ish
>> *ishly
>> un*ish
>> un*ishly
>> *less
>> *lessly
>> un*less
>> un*lessly
>> position:
>> *s
>> *ed
>> *al
>> *ally
>> re*
>> re*s
>> re*ed
>> *less
>> mis*
>> *er
>> *ers
>> friend
>> *s
>> *ly
>> *liness
>> be*
>> be*ing
>> be*ed
>> be*er
>> be*ers
>> These are not particularly extreme examples.

>
> It's not a question of how extreme the examples are but how many there are.
>
> Not all words can be legitimately affixed. Many can be affixed by
> algorithm, or by bitmaps as to which affixes apply, so you only store
> the root, the bitmap and perhaps one more form.
>
> I don't know how much memory expansion you think your factors will
> cause, as you only hand wave and say there will be some and act like
> it's a problem, but let's say it doubles the size of the dictionary. By
> Daniel's experiment upthread, that would bring it to around 8 MiB, let's
> round and say 10MiB. Being text and all, that should compress to about 3
> MiB or less.
>
> Now I am interested to hear what sort of trouble you assert that 3 MiB
> or so of storage will cause.
>

Actually, the word lists I used to come up with my figures include those
inflections of the words that it is used for.

For example "ishly" and "ness" suffixes:

> grep "ishly\b" * | wc -l

323

Which includes:
wordishly
womanishly
wolfishly
wishly
winterishly
wildishly
whorishly
whoreishly

Most of these entries aren't even in Thunderbirds spell-check dictionary.

> grep "ness\b" * | wc -l

11762

Includes entries such as:
woodlessness
wonderfulness
unmysteriousness
unexperiencedness
undeceptiveness
nonvolatileness


Again, most of these aren't spell-check.

My numbers are accurate, but thanks for questioning them, so I could
further explore and see for myself.
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      05-29-2012
On 5/29/12 9:14 AM, Gene Wirchenko wrote:
> On Mon, 28 May 2012 21:54:39 -0700, Lew<(E-Mail Removed)> wrote:
>
>> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>>> <(E-Mail Removed)> wrote:
>>>
>>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:

>
> [snip]
>
>>> These are not particularly extreme examples.

>>
>> It's not a question of how extreme the examples are but how many there are.

>
> I mentioned extremity just to point out that such base words are
> not that unusual. There are a lot of them.
>
>> Not all words can be legitimately affixed. Many can be affixed by algorithm,
>> or by bitmaps as to which affixes apply, so you only store the root, the
>> bitmap and perhaps one more form.

>
> There are multiple affixes. I suppose they could be combined
> into aggregate affixes. e.g. -less + -ness -> -lessness.
>
>> I don't know how much memory expansion you think your factors will cause, as
>> you only hand wave and say there will be some and act like it's a problem, but
>> let's say it doubles the size of the dictionary. By Daniel's experiment

>
> Let's not make up data. I would like to know how much of the
> English language actually is in his dataset.

Agreed, let's not make up data. Using the Moby word-list as a guide, it
includes a significant number of those particular suffixes and prefixes
you've mentioned. Granted, its not a complete word-list, but then again
nothing is. It is "complete enough" for most purposes.

>
>> upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
>> Being text and all, that should compress to about 3 MiB or less.
>>
>> Now I am interested to hear what sort of trouble you assert that 3 MiB or so
>> of storage will cause.

>
> Assuming your facts and then calling me on what you made up?
>
> http://oxforddictionaries.com/words/...glish-language
> is short but interesting reading. Its final paragraph: "This suggests
> that there are, at the very least, a quarter of a million distinct
> English words, excluding inflections, and words from technical and
> regional vocabulary not covered by the OED, or words not yet added to
> the published dictionary, of which perhaps 20 per cent are no longer
> in current use. If distinct senses were counted, the total would
> probably approach three quarters of a million."
>
> Now, add on what they exclude. Is one million words out of line?
> Itis one thing to have some of a language encoded. It is quite
> another to try to handle everything. Exceptional cases tend to be
> more difficult to handle.


Are you arguing that a modern system can't handle that number of words?
A modern desktop has more than enough memory to easily handle a quarter
*billion* words, which is a 100 times greater than your guessed upper limit.

And that's *without* compression.
 
Reply With Quote
 
Jeff Higgins
Guest
Posts: n/a
 
      05-29-2012
On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>
> Most of these entries aren't even in Thunderbirds spell-check dictionary.
>

How many seven letter words can I construct from the twenty-six letters
A through Z? How many of these words are defined English words? What
are/are not the common affixes in this set of English words?
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      05-29-2012
On 5/29/12 10:37 AM, Jeff Higgins wrote:
> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>
>> Most of these entries aren't even in Thunderbirds spell-check dictionary.
>>

> How many seven letter words can I construct from the twenty-six letters
> A through Z? How many of these words are defined English words? What
> are/are not the common affixes in this set of English words?


How about you download the Moby word list yourself and do these
analyses? It is a very simple grep for the first question.
egrep "\b[a-zA-Z]{7}\b" *.txt



 
Reply With Quote
 
Jeff Higgins
Guest
Posts: n/a
 
      05-29-2012
On 05/29/2012 01:49 PM, Daniel Pitts wrote:
> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>
>>> Most of these entries aren't even in Thunderbirds spell-check
>>> dictionary.
>>>

>> How many seven letter words can I construct from the twenty-six letters
>> A through Z? How many of these words are defined English words? What
>> are/are not the common affixes in this set of English words?

>
> How about you download the Moby word list yourself and do these
> analyses?

I was asking you.
It is a very simple grep for the first question.
> egrep "\b[a-zA-Z]{7}\b" *.txt

Nope.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
A Note Of Appreciation for Patricia Shanahan Roedy Green Java 3 04-27-2006 01:05 AM
patricia mcgowan Graham Stuart Computer Support 1 01-19-2005 08:06 PM
compressed suffix trie Joseph Java 1 09-22-2004 07:23 PM
compressed suffix trie Joseph C++ 3 09-22-2004 06:02 PM
ruby replacement for net::patricia needed jm Ruby 5 05-10-2004 01:04 AM



Advertisments