Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Using Classifier::LSI

Reply
Thread Tools

Using Classifier::LSI

 
 
chrisjroos@gmail.com
Guest
Posts: n/a
 
      02-08-2006
Hi,

I've just spent some time looking at the ruby classifier[1] library.

I need to index 3000 one line product names.

The default behaviour is to index items as they are inserted (at least
it would certainly appear that's what's happening based on simple
benchmarking). This causes the time to index new items to rise
exponentially. After some experimentation, I found that turning off
the automatic indexing and then manually building once all items were
added was far quicker (17 secs to add 100 items with auto indexing, 0.6
secs to add 100 items with manual index building).

I also did a quick test to confirm that the output was the same in both
cases.

I'm really just wondering why the default behaviour seems to be the
most inefficient and whether I'm missing anything obvious using the
manual build method?

Any help / pointers would be much appreciated.

Chris

[1] http://classifier.rufy.com

 
Reply With Quote
 
 
 
 
Cameron McBride
Guest
Posts: n/a
 
      02-08-2006
Hey Chris,

On 2/8/06, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:

> The default behaviour is to index items as they are inserted (at least
> it would certainly appear that's what's happening based on simple
> benchmarking). This causes the time to index new items to rise
> exponentially.


right. the index building takes time and there is a lot of double work goi=
ng
on.

> After some experimentation, I found that turning off
> the automatic indexing and then manually building once all items were
> added was far quicker (17 secs to add 100 items with auto indexing, 0.6
> secs to add 100 items with manual index building).


Great. This is exactly why there is an :auto_rebuild option, so you can ma=
ke
this more effecient (like you discovered).

> I'm really just wondering why the default behaviour seems to be the
> most inefficient and whether I'm missing anything obvious using the
> manual build method?


I believe the default behavior is directed toward simple usage to help give
people a feel for what the library does. Given the mathematics underneath =
the
index building, when you have 3000 items the index building is slower.
(I believe the docs say around 500 is rule of thumb for speedy
return).

I don't think you're missing anything, here. Just use
:auto_rebuild=3D>false and build it manually.

I was actually advocating a lazy evaluation as the default (so the
index wouldn't be built until you needed it, which would work rather
effeciently for both simple usage and your case). The argument
against this was to prevent users from thinking the searching was
slow. Once the index is built, everything else flies. LSI just
really is math intensive, which is why there is a C-based backend for
it. Just make sure you're using something other than the pure-ruby
version if speed becomes an issue.

Let us know if anything else is unclear.


 
Reply With Quote
 
 
 
 
chrisjroos@gmail.com
Guest
Posts: n/a
 
      02-09-2006
Ok, so it took just over 7 hours to build the index of 3000 items.

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?

BTW. I am using the c extension.

Thanks in advance,

Chris

 
Reply With Quote
 
Timothy Goddard
Guest
Posts: n/a
 
      02-09-2006
I'm doing a project at the moment cataloguing publications where I've
been using Ferret (a ruby port of lucene). I have to say, this is a
brilliant library. It has really good search capablilities, division of
'documents' into various fields that can be individually searched, and
it's fast. Here are some basic benchmarks using the most simple entry
mode (both stores and tokenises all input) using ruby 1.8.4
(2005-10-29) [i686-linux] on a 1.4GHz Celeron M processor, 768MB
memory:

tim@ghostknife:~/programming/ruby> cat bm_ferret.rb
require 'benchmark'
require 'rubygems'
require 'ferret'

$index = Ferret::Index::Index.new ath => './test_index', :create =>
true
$words = []

File.open('/usr/share/dict/words') do |dict|
dict.each_line do |line|
$words << line.chomp
end
end
def get_words
ret = []
10.times do
ret << $words[(rand * $words.length).to_i]
end
ret
end
Benchmark.bmbm(7) do |x|
x.report('Filing: ') do
1000.times {$index << {:key => (rand * 1000000).to_i,
:data => get_words.join(' ')}}
end
end

tim@ghostknife:~/programming/ruby> ruby bm_ferret.rb
Rehearsal --------------------------------------------
Filing: 7.920000 0.630000 8.550000 ( 8.55150
----------------------------------- total: 8.550000sec

user system total real
Filing: 7.680000 0.700000 8.380000 ( 8.63788

 
Reply With Quote
 
Cameron McBride
Guest
Posts: n/a
 
      02-09-2006
Chris,

On 2/9/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
> Ok, so it took just over 7 hours to build the index of 3000 items.


something sounds drastically wrong.

> Since then it has been running for a further 12 hours trying to use
> that index to obtain likely matches for the same 3000 items; i.e. for
> each of the 3000 items I am trying to get the best matches from the
> index (using find related).


again, something seems funny here. Just performing a benchmark on the
dominant calculation for index building using for somewhere around 3000
documents with 50 unique keywords. This took on the order of 4 minutes
on my 1.3GHz Athlon. The pure-ruby version would take exponentially longer=
 
Reply With Quote
 
Matthew Smillie
Guest
Posts: n/a
 
      02-09-2006

On Feb 9, 2006, at 7:48, (E-Mail Removed) wrote:
> Since then it has been running for a further 12 hours trying to use
> that index to obtain likely matches for the same 3000 items; i.e. for
> each of the 3000 items I am trying to get the best matches from the
> index (using find related).
>
> Should I even bother waiting for it to finish or should I be
> investigating something else to achieve similar results?


Can't comment on the time it takes, but the data you're using doesn't
seem particularly suited to LSI, in my opinion (and this sort of
thing is my occupation these days). LSI's not magic - what it's
doing is taking advantage of the statistical properties of language.
So it needs two things to work well: a relatively large set of words
compared to the number of items, and the items should be (more or
less) standard language.

Obviously I don't know exactly what the product names are, but as a
class, product names don't strike me as fitting those constraints
very well. Firstly because I expect them to be fairly short (5-6
words, tops?), and secondly because they lack a lot of the syntax and
semantic relations that you'd find in a sentence (nominals don't have
very much internal structure, in general).

Other approaches that might be promising might be standard word/
document search (like ferret, already mentioned), or a language model
approach, which works using relative frequencies of words. In the
power tool domain, for instance, "grit" might correlate highly with
"sander", and so you could say that anything with "grit" in it is
related to sanding.

That said, I'm not aware of any Ruby libraries which implement this
sort of thing, so if you wanted to stick with Ruby, you'd be doing it
yourself (it's not a particularly sophisticated approach, though, so
it likely wouldn't be that hard).

matthew smillie.



----
Matthew Smillie <(E-Mail Removed)>
Institute for Communicating and Collaborative Systems
University of Edinburgh




 
Reply With Quote
 
chrisjroos@gmail.com
Guest
Posts: n/a
 
      02-09-2006
Ok, so I'm thinking (taking into account what Matthew Smillie has said)
that it's due to the number of 'unique' words found in my documents -
4343 unique words found in 3320 product names.

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

Thanks for the help folks.

Chris

 
Reply With Quote
 
Cameron McBride
Guest
Posts: n/a
 
      02-09-2006
On 2/9/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
> Ok, so I'm thinking (taking into account what Matthew Smillie has said)
> that it's due to the number of 'unique' words found in my documents -
> 4343 unique words found in 3320 product names.
>
> I have actually spent the morning looking at ferret and some other text
> searching libraries and at the moment I think ferret will do what I
> require.


sounds like the right approach, glad you found something that worked.

Cameron


 
Reply With Quote
 
Dave Fayram
Guest
Posts: n/a
 
      02-09-2006
>Ok, so it took just over 7 hours to build the index of 3000 items.

Depending on the size of your inputs, it can take longer.
Classifier::LSI has a very specific sweet spot of data volume in its
current incarnation. It's not currently suited for very large loads. I
am currently working on a version that will work with the MPI so that
clusters can attack the problem. LSI just isn't useful for large
datasets without _very_ hefty hardware.

But if your matrix is only 3000x4000-ish, it shouldn't take that long.
Make sure that your classifier is binding to gsl, and if you need to,
email me and we can pursue it further. I only experience a breakdown in
speed when my unique-words list is in the 15k+ range, with more than
1500 items.

Also, Remember that LSI wants to work with the *semantics*. A list of
product titles isn't going to bring very much semantic data into the
system. The way I originally envisioned using LSI was to create
summaries of documents, and working on sets of document abstracts to
rapidly provide relevant ACM papers.

 
Reply With Quote
 
Cameron McBride
Guest
Posts: n/a
 
      02-09-2006
On 2/9/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
> I have actually spent the morning looking at ferret and some other text
> searching libraries and at the moment I think ferret will do what I
> require.


There is also a Classifier::Bayes available that might be more in sync
with what you're trying to do.

Cameron


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using a Link Button to redirect to another page by using data from =?Utf-8?B?R1REcml2ZXI=?= ASP .Net 1 02-16-2005 07:04 PM
Error page using the Application_Error void dosnt work when using DIV for a please wait message s_erez@hotmail.com ASP .Net 2 12-24-2004 12:11 PM
no code in webform using vs.net, but in webform using notepad timmso ASP .Net 1 12-12-2003 04:30 PM
Using GetOleDbSchemaTable to get SQL Server Field Description - using pete ASP .Net 1 08-29-2003 10:50 AM
Re: MVP? Index error on nested element using System.xml but NOT using msxml??? William F. Robertson, Jr. ASP .Net 1 06-25-2003 08:08 PM



Advertisments