Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > string matching algorithm

Reply
Thread Tools

string matching algorithm

 
 
David Weldon
Guest
Posts: n/a
 
      11-24-2007
Problem: 1 million+ strings (Set A) need to be matched with 1 million+
substrings (Set B). For example:

Set A =
iliketotraveltohawaii
travelmagazine

Set B =
travel
hawaii

A(1,2) match "travel"
A(1) matches "hawaii"

What is the best approach to take with this problem? I have tried using
ferret:

http://www.ruby-forum.com/topic/132772

Indexing is fast, but search is very slow. I think ferret could be a
good choice if I had the substrings split out so iliketotraveltohawaii
-> "i like to travel to hawaii". I have a solution to this problem but
it can't be trusted with misspellings (that's an entirely different
forum topic). Is there something obvious that I'm missing here?
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Axel Etzold
Guest
Posts: n/a
 
      11-24-2007

-------- Original-Nachricht --------
> Datum: Sun, 25 Nov 2007 07:40:50 +0900
> Von: David Weldon <(E-Mail Removed)>
> An: http://www.velocityreviews.com/forums/(E-Mail Removed)
> Betreff: string matching algorithm


> Problem: 1 million+ strings (Set A) need to be matched with 1 million+
> substrings (Set B). For example:
>
> Set A =
> iliketotraveltohawaii
> travelmagazine
>
> Set B =
> travel
> hawaii
>
> A(1,2) match "travel"
> A(1) matches "hawaii"
>
> What is the best approach to take with this problem? I have tried using
> ferret:
>
> http://www.ruby-forum.com/topic/132772
>
> Indexing is fast, but search is very slow. I think ferret could be a
> good choice if I had the substrings split out so iliketotraveltohawaii
> -> "i like to travel to hawaii". I have a solution to this problem but
> it can't be trusted with misspellings (that's an entirely different
> forum topic). Is there something obvious that I'm missing here?
> --
> Posted via http://www.ruby-forum.com/.


Dear David,

just curious: how fast is Array#grep ?

a=%w[liketotraveltohawaii travelmagazine]
b=%w[travel hawaii]

b.each_with_index{|x,i|
if a.grep(x)
puts 'entry ' + i.to_s + ' in Array a matches ' + x
end
}

Best regards,

Axel
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

 
Reply With Quote
 
 
 
 
Axel Etzold
Guest
Posts: n/a
 
      11-24-2007

-------- Original-Nachricht --------
> Datum: Sun, 25 Nov 2007 07:40:50 +0900
> Von: David Weldon <(E-Mail Removed)>
> An: (E-Mail Removed)
> Betreff: string matching algorithm


> Problem: 1 million+ strings (Set A) need to be matched with 1 million+
> substrings (Set B). For example:
>
> Set A =
> iliketotraveltohawaii
> travelmagazine
>
> Set B =
> travel
> hawaii
>
> A(1,2) match "travel"
> A(1) matches "hawaii"
>
> What is the best approach to take with this problem? I have tried using
> ferret:
>
> http://www.ruby-forum.com/topic/132772
>
> Indexing is fast, but search is very slow. I think ferret could be a
> good choice if I had the substrings split out so iliketotraveltohawaii
> -> "i like to travel to hawaii". I have a solution to this problem but
> it can't be trusted with misspellings (that's an entirely different
> forum topic). Is there something obvious that I'm missing here?
> --
> Posted via http://www.ruby-forum.com/.


Dear David,

sorry, the last post wasn't correct. I mean:

a=%w[liketotraveltohawaii travelmagazine]
b=%w[travel hawaii]

b.each_with_index{|x,i|
a.each{|y|
if y.grep(x)
puts 'entry ' + y + ' in Array a contains ' + x
end
}
}

Best regards,

Axel

--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger

 
Reply With Quote
 
Alex Young
Guest
Posts: n/a
 
      11-25-2007
David Weldon wrote:
> Problem: 1 million+ strings (Set A) need to be matched with 1 million+
> substrings (Set B). For example:
>
> Set A =
> iliketotraveltohawaii
> travelmagazine
>
> Set B =
> travel
> hawaii
>
> A(1,2) match "travel"
> A(1) matches "hawaii"
>
> What is the best approach to take with this problem?


Bloom filters? It's the fastest thing I can think of.

--
Alex

 
Reply With Quote
 
David Weldon
Guest
Posts: n/a
 
      11-25-2007
I think I'm going to end up answering my own question here. I tried a
bloom filter kind of approach and various grep schemes. None of which
scaled well to large data sets. So far my best solution has been to use
Ferret and index on trigrams instead of unigrams like I was doing
before. That sped up my search by ~100x. I'm open to other ideas if
anyone has them, but for now this should be fast enough.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Algorithm for fuzzy string matching Martin Hansen Ruby 1 03-23-2011 08:11 PM
Help with Pattern matching. Matching multiple lines from while reading from a file. Bobby Chamness Perl Misc 2 05-03-2007 06:02 PM
compilation error: "error: no matching function for call to 'String::String(String)' =?ISO-8859-1?Q?Martin_J=F8rgensen?= C++ 5 05-06-2006 03:48 PM
Pattern matching : not matching problem Marc Bissonnette Perl Misc 9 01-13-2004 05:52 PM
Maximum Weighted Matching Algorithm Dan Peder Eriksen Java 3 08-17-2003 09:35 PM



Advertisments