Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Pattern matching for "best match"

Reply
Thread Tools

Pattern matching for "best match"

 
 
Yash
Guest
Posts: n/a
 
      01-28-2004
We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      01-28-2004

http://www.velocityreviews.com/forums/(E-Mail Removed) (Yash) wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.


Define what you mean by the 'best' match. The one that matches the
most text? The intuitive (at least to me) meaning would involve
working out that (e.g.) /vodafone.*horoscope/ will match a subset of
the strings that /vodafone/ will match, and therefore the first is a
more specific ('better') match than the second. I would imagine this
would be pretty hard to work out just starting from the regexes.

Ben

--
Musica Dei donum optimi, trahit homines, trahit deos. |
Musica truces mollit animos, tristesque mentes erigit. | (E-Mail Removed)
Musica vel ipsas arbores et horridas movet feras. |
 
Reply With Quote
 
 
 
 
Tore Aursand
Guest
Posts: n/a
 
      01-28-2004
On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:
> We have a list of regular expressions such as: vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds. Our program
> reads a large file with millions of URLs. For each URL, it has to find
> the best match in the list of regular expressions.


You never define "best" in your posting. What do _you_ require to be a
perfect match? Almost perfect match? No match at all? Etc.

> For example www.vodafome.com will match with vodafone.


It will? Why?

> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for a
> given URL.


You could always try to treat the strings you match against a regular
expressions and in addition try to compare the strings letter by letter,
and in turn the distance between each letter in each string.

There are modules out there for comparing strings, as well. I suggest you
check out CPAN;

http://www.cpan.org/


--
Tore Aursand <(E-Mail Removed)>
"Life is pleasant. Death is peaceful. It's the transition that's
troublesome." -- Isaac Asimov
 
Reply With Quote
 
James Willmore
Guest
Posts: n/a
 
      01-28-2004
On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:

> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.


Are you trying to do web filtering or redirection? Fuzzy logic searches?
What?

And .... where's the code? Or is this where you're stuck?

NEI (Not Enoguh Information)

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
If all the world's a stage, I want to operate the trap door. --
Paul Beatty

 
Reply With Quote
 
Jim Keenan
Guest
Posts: n/a
 
      01-28-2004
(E-Mail Removed) (Yash) wrote in message news:<(E-Mail Removed) om>...
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
>

Define "best".
 
Reply With Quote
 
Steve
Guest
Posts: n/a
 
      01-28-2004
(E-Mail Removed) (Yash) wrote in message news:<(E-Mail Removed) om>...
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.
>
> Thanks


If you define 'the best match' as the string with the longest
continual stretch of characters identical to a target, then molecular
biology has developed some very efficient ways to do this to align DNA
(and protein) sequences. BLAST is the most commonly used and you can
get a description by Googling - essentially it takes a 'seed' of a few
characters, attempts to match and when it finds a match it looks
either side to extend the alignment and then ranks by length of
overlap. Molecular biologists need to worry about allowing
discrepancies and gaps within the match, but I suspect that for your
application a simple exact match is good enough.

I need not point out that the scale of your problem is tiny compared
with aligning genomes (but I did).

HTH

Steve
 
Reply With Quote
 
Gunnar Strand
Guest
Posts: n/a
 
      01-29-2004
Hi Yash,

Yash wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.
>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.
>
> Thanks


I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

Regards,

/Gunnar

 
Reply With Quote
 
Sam Holden
Guest
Posts: n/a
 
      01-29-2004
On Thu, 29 Jan 2004 06:21:08 +0100,
Gunnar Strand <(E-Mail Removed)> wrote:
>
> I assume you have put all patterns in "priority" order
> so that the first hit is the "best". My very nice Perl
> teacher suggested to generate an anonymous sub if I was
> going to do what you are doing:
>
> my $sub = 'sub test_for_match { $_ = $_[0]; '.
> join( '', map { "/$_/o && return $_; " } @patterns ) .
> 'return "No match"; }';
> eval { $sub };
> test_for_match( 'www.vodafone.com' );


There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.


> Haven't tested the code, but I hope I conveyed the message.
> The subroutine gets compiled once so it has little overhead
> and there are probably other optimization which can be done.


5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 199 due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?

--
Sam Holden
 
Reply With Quote
 
ctcgag@hotmail.com
Guest
Posts: n/a
 
      01-29-2004
(E-Mail Removed) (Yash) wrote:
> We have a list of regular expressions such as:
> vodafone.*horoscope
> vodafone.*pxtsend
> vodafone
> yahoo
>
> The total number of such reg expressions is few hundreds.
> Our program reads a large file with millions of URLs. For each URL, it
> has to find the best match in the list of regular expressions.
> For example www.vodafome.com will match with vodafone.
> www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
> Basically we have to find the regular expression that matches best for
> a given URL.


I assume the regular expressions are in order of "goodness".

warning "Untested code";
my @qr_rigo= map { qr/$_/ } @regex_in_goodness_order;
sub get_best {
study $_[0]; ## Will this help? Try and see.
foreach (@qr_rigo) {
return $_ if $_[0] =~ /$_/;
};
return;
};

while (my $url=<>) {
chomp $url;
my $b=get_best($url);
##
};


>
> Do you have any suggestions/pointers for doing this in a very
> efficient way? Suggestions regarding data structures to use,
> algorithm, etc. would be helpful.


If the above is not efficient enough (or if you have no way to order the
regexes by goodness by hand), then further optimization probably relies
minutes details and trade-offs which we don't know about. (What you mean
by "best match", how many of the 1000s of regex are substrings of each
other, the skew in the actual data set, etc.)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service New Rate! $9.95/Month 50GB
 
Reply With Quote
 
Gunnar Strand
Guest
Posts: n/a
 
      01-29-2004
Sam,

You are quite right, that was really, really bad
suggestion in more than one way. Sorry about that.

/Gunnar

Sam Holden wrote:
> On Thu, 29 Jan 2004 06:21:08 +0100,
> Gunnar Strand <(E-Mail Removed)> wrote:
>
>>I assume you have put all patterns in "priority" order
>>so that the first hit is the "best". My very nice Perl
>>teacher suggested to generate an anonymous sub if I was
>>going to do what you are doing:
>>
>>my $sub = 'sub test_for_match { $_ = $_[0]; '.
>> join( '', map { "/$_/o && return $_; " } @patterns ) .
>> 'return "No match"; }';
>>eval { $sub };
>>test_for_match( 'www.vodafone.com' );

>
>
> There is no anonymous sub in that code.
>
> And the test_for_match subroutine never exists since you are using block
> eval which doesn't do what you think it does. Anyway using an eval
> correctly would just lead to more problems because the code is
> completely wrong.
>
> If your perl teacher suggested such code, find another perl teacher.
> More likely you misinterpreted them.
>
>
>
>>Haven't tested the code, but I hope I conveyed the message.
>>The subroutine gets compiled once so it has little overhead
>>and there are probably other optimization which can be done.

>
>
> 5 seconds of testing would show that it doesn't work, thanks for letting
> lots of people do so instead of you doing so before posting it.
>
> That style of code generating a subroutine isn't necessary since perl
> 5.005 (ie. since mid 199 due to the introduction of qr//.
>
> See the obvious FAQ:
>
> perlfaq6: How do I efficiently match many regular expressions at once?
>


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with Pattern matching. Matching multiple lines from while reading from a file. Bobby Chamness Perl Misc 2 05-03-2007 06:02 PM
Matching neighbouring words of a pattern using Regex CV Perl 2 08-31-2004 12:27 AM
Pattern matching : not matching problem Marc Bissonnette Perl Misc 9 01-13-2004 05:52 PM
Pattern matching help! grep emails from file! danpres2k Perl 3 08-25-2003 02:47 PM
A newbie question on pattern matching DelphiDude Perl 3 07-26-2003 12:54 PM



Advertisments