Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Needs help with Matching Logic

Reply
Thread Tools

Needs help with Matching Logic

 
 
Kishore
Guest
Posts: n/a
 
      07-20-2004
I am comparitively a newbie in Perl.
I am working a logic to display the snippets matched results of a
'keyword' from a text file just like google would do in the search
results.

I have the content of the text file in the variable $file_content.
And I have the 'keyword' in $keyword.

I need to get the string like google does when displaying the search
results..
When I match the $keyword in the $file_content, I want to also pull 5
words before and 5 words after so I can show that snippet of the file
where the matching of the keyword occurs.

I searched in the google groups for a few days, but couldn't find
anything to help me.

I really appreciate any help I can get.

Thanks!
Kishore
 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      07-20-2004
On Tue, 20 Jul 2004, Kishore wrote:

> I am comparitively a newbie in Perl.
> I am working a logic to display the snippets matched results of a
> 'keyword' from a text file just like google would do in the search
> results.
>
> I have the content of the text file in the variable $file_content.
> And I have the 'keyword' in $keyword.
>
> I need to get the string like google does when displaying the search
> results..
> When I match the $keyword in the $file_content, I want to also pull 5
> words before and 5 words after so I can show that snippet of the file
> where the matching of the keyword occurs.
>
> I searched in the google groups for a few days, but couldn't find
> anything to help me.
>
> I really appreciate any help I can get.


how about something like:

m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

Using that, $1 is the series of up to five words before the match, $2 is
the match, and $3 is the series of up to five words after the match.

It'd probably have to be tweaked a bit to get exactly what you want, but
it should at least give you a starting point.

Paul Lalli
 
Reply With Quote
 
 
 
 
Kishore
Guest
Posts: n/a
 
      07-20-2004
Paul Lalli <(E-Mail Removed)> wrote in message
> how about something like:
>
> m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
>
> Using that, $1 is the series of up to five words before the match, $2 is
> the match, and $3 is the series of up to five words after the match.
>


It works really great.

Thank you very much.

What is colon( for? I don't believe I saw this in the books I have
been refering to so far.

Thanks!
- Kishore.
 
Reply With Quote
 
gnari
Guest
Posts: n/a
 
      07-20-2004
"Kishore" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Paul Lalli <(E-Mail Removed)> wrote in message
> > how about something like:
> >
> > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
> >

>
> It works really great.
>
> What is colon( for? I don't believe I saw this in the books I have
> been refering to so far.


(?:...)

look up 'Extended Patterns' in
perldoc perlre

gnari



 
Reply With Quote
 
Ilmari Karonen
Guest
Posts: n/a
 
      07-21-2004
On 2004-07-20, Paul Lalli <(E-Mail Removed)> wrote:
>
> m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
>
> Using that, $1 is the series of up to five words before the match, $2 is
> the match, and $3 is the series of up to five words after the match.


Note that if $keyword is supposed to be a plain string rather than a
regex, you'll neet to escape metacharacters in it. An easy way to do
this is:

m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/

Also, this regex can be optimized a bit by noting that the only way $1
can contain less than 5 words is if the match occurs at the very
beginning of the string. Separating that special case, we get:

m/((?:\S+\s+){5}|^\s*(?:\S+\s+){0,4})(\Q$keyword\E)( (?:\s+\S+){0,5})/

This is noticeably faster if the first occurrence of $keyword isn't
near the beginning, since it saves the regex engine some needless
backtracking.

Also note that, if you use global matching to extract multiple
snippets from the text, the results can be unexpected if there are
multiple occurrences of $keyword near each other. In particular, if
there are less than 5 words between two occurrences, the second one
will be swallowed in the 5 words matched after the first one.

The easiest way to fix that is to use negative look-ahead:

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keywo rd\E)\S+){0,5})/g

Oddly enough, optimizing this regex the same way as before doesn't
seem to help, and seems to tickle a perl bug (probably related to \G
handling?) when used in scalar context.


Oh, and you probably want case-insensitive matching, and should
probably allow punctuation around $keyword, something like:

m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i

or (optimized):

m/((?:\w+\W+){5}|^\W*(?:\w+\W+){0,4})(\Q$keyword\E)( (?:\W+\w+){0,5})/i

or for global matching:

m/((?:\w+\W+){0,5}?)(\Q$keyword\E)((?:\W+(?!\Q$keywo rd\E)\w+){0,5})/ig

--
Ilmari Karonen
If replying by e-mail, please replace ".invalid" with ".net" in address.
 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      07-21-2004
Ilmari Karonen <(E-Mail Removed)> writes:

> On 2004-07-20, Paul Lalli <(E-Mail Removed)> wrote:
> >
> > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
> >
> > Using that, $1 is the series of up to five words before the match, $2 is
> > the match, and $3 is the series of up to five words after the match.

>
> Note that if $keyword is supposed to be a plain string rather than a
> regex, you'll neet to escape metacharacters in it. An easy way to do
> this is:
>
> m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/


> Also note that, if you use global matching to extract multiple
> snippets from the text, the results can be unexpected if there are
> multiple occurrences of $keyword near each other. In particular, if
> there are less than 5 words between two occurrences, the second one
> will be swallowed in the 5 words matched after the first one.
>
> The easiest way to fix that is to use negative look-ahead:
>
> m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keywo rd\E)\S+){0,5})/g


Er, no it would be easier and more ideomatic to put the third capture
inside a lookahead.

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5 }))/g


--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      07-21-2004

Quoth http://www.velocityreviews.com/forums/(E-Mail Removed) (Kishore):
> Paul Lalli <(E-Mail Removed)> wrote in message
> > how about something like:
> >
> > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

>
> What is colon( for? I don't believe I saw this in the books I have
> been refering to so far.


The construction is (?: ... ), to be contrasted with ( ... ); it modifes
the parens so that they just group without capturing. See perldoc
perlre or perldoc perlretut.

[as a side note, I would *always* use /x on a regex with (? in, just
because things get lost:

/( (?: \S+\s+ ){0,5} ) ($keyword) ( (?: \s+\S+ ){0,5} )/x

]

Ben

--
"If a book is worth reading when you are six, * (E-Mail Removed)
it is worth reading when you are sixty." - C.S.Lewis
 
Reply With Quote
 
Ilmari Karonen
Guest
Posts: n/a
 
      07-22-2004
On 2004-07-21, Brian McCauley <(E-Mail Removed)> wrote:
> Ilmari Karonen <(E-Mail Removed)> writes:
>>
>> Also note that, if you use global matching to extract multiple
>> snippets from the text, the results can be unexpected if there are
>> multiple occurrences of $keyword near each other. In particular, if
>> there are less than 5 words between two occurrences, the second one
>> will be swallowed in the 5 words matched after the first one.
>>
>> The easiest way to fix that is to use negative look-ahead:
>>
>> m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keywo rd\E)\S+){0,5})/g

>
> Er, no it would be easier and more ideomatic to put the third capture
> inside a lookahead.
>
> m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5 }))/g


Those two don't do the same thing. With your version the snippets may
overlap, with mine they can't. Deciding which solution is better is
really up to the OP.

--
Ilmari Karonen
If replying by e-mail, please replace ".invalid" with ".net" in address.
 
Reply With Quote
 
Kishore
Guest
Posts: n/a
 
      07-22-2004
Ilmari Karonen <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>. ..
> On 2004-07-20, Paul Lalli <(E-Mail Removed)> wrote:
>
> Oh, and you probably want case-insensitive matching, and should
> probably allow punctuation around $keyword, something like:
>
> m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i
>


I was having problems with punctuation.
This code solved the problem.
Thanks very much.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Needs help in logic Eric Java 3 07-11-2011 06:18 PM
Python Logic Map/Logic Flow Chart. (Example Provided) spike Python 8 02-09-2010 12:31 PM
Asynchronous Logic Gates and Analog Logic Gates Jyoti Ballabh Software 3 11-26-2009 06:48 PM
Help with Pattern matching. Matching multiple lines from while reading from a file. Bobby Chamness Perl Misc 2 05-03-2007 06:02 PM
Newbie needs help on pattern matching Madhusudan Singh Perl Misc 7 09-03-2004 04:33 PM



Advertisments