![]() |
"negative" regex matching?
Hi,
I have a regex question. I have arbitrary text and I want to search it for a set of terms/substrings. In the simple case of one term it is easy to find the match(es) and then mark them up with HTML "span" tags. My issue is with more than one term. Here is an example to illustrate. If I have the string: Sarah likes Johnny's cooking and the single term: "john" then I can match and highlight the match resulting in: Sarah likes <span>John</span>ny's cooking Now what if I have two terms: "Johnny" & "john" -- in that order? I can easily let myself end up with (in sequence): <apply Johnny match> Sarah likes <span>Johnny</span>'s cooking <apply john match> Sarah likes <span><span>John</span>ny</span>'s cooking Ok, so what I want is to be able to search for and mark each term in the string as long as that term is not already in a "span" clause. I've done some digging in Friedl's RegEx book but I'm not sure if I know enough to know what I am looking for? ideas? |
Re: "negative" regex matching?
On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <seven.reeds@gmail.com> wrote:
>Hi, > >I have a regex question. I have arbitrary text and I want to search >it for a set of terms/substrings. In the simple case of one term >it is easy to find the match(es) and then mark them up with HTML >"span" tags. My issue is with more than one term. > >Here is an example to illustrate. If I have the string: > > Sarah likes Johnny's cooking > >and the single term: "john" then I can match and highlight the match >resulting in: > > Sarah likes <span>John</span>ny's cooking > >Now what if I have two terms: "Johnny" & "john" -- in that order? I >can easily let myself end up with (in sequence): > > <apply Johnny match> > Sarah likes <span>Johnny</span>'s cooking > <apply john match> > Sarah likes <span><span>John</span>ny</span>'s cooking > >Ok, so what I want is to be able to search for and mark each term in >the string as long as that term is not already in a "span" clause. > >I've done some digging in Friedl's RegEx book but I'm not sure if I >know enough to know what I am looking for? > >ideas? This what you are trying to do? rxhtml.pl -sln ---------------- use strict; use warnings; ## globs .. my $string = " <apply Johnny match> Sarah likes Johnny's cooking <apply john match> Sarah likes Johnny's cooking "; ## code .. # use terms: Johnny,john if ( getMatch( $string,'span','Johnny|john')) # add mods in term's { print "Matched:\n'$string'\n\n" } else { print "No match.\n\n" } # use terms: King,john .. case insensitive if ( getMatch( $string,'span','(?i)King|john')) { print "Matched:\n'$string'\n\n" } else { print "No match.\n\n" } exit(0); ## subs .. sub getMatch { my ($tag,$terms) = @_[1,2]; $_[0] =~ s {(?<!<$tag>)(.*)($terms)(?!.*</?$tag>)} {$1<$tag>$2</$tag>}g; } __END__ Matched: ' <apply <span>Johnny</span> match> Sarah likes <span>Johnny</span>'s cooking <apply <span>john</span> match> Sarah likes <span>Johnny</span>'s cooking ' Matched: ' <apply <span>Johnny</span> match> Sarah likes <span>Johnny</span>'s coo<span>king</span> <apply <span>john</span> match> Sarah likes <span>Johnny</span>'s coo<span>king</span> ' |
Re: "negative" regex matching?
On Sat, 05 Dec 2009 12:45:14 -0800, sln@netherlands.com wrote:
>On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <seven.reeds@gmail.com> wrote: > >>ideas? > >This what you are trying to do? > Yeah but don't do this, it doesen't work. -sln |
Re: "negative" regex matching?
On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <seven.reeds@gmail.com> wrote:
>Hi, > >I have a regex question. I have arbitrary text and I want to search >it for a set of terms/substrings. In the simple case of one term >it is easy to find the match(es) and then mark them up with HTML >"span" tags. My issue is with more than one term. > [snip] > >Ok, so what I want is to be able to search for and mark each term in >the string as long as that term is not already in a "span" clause. > >I've done some digging in Friedl's RegEx book but I'm not sure if I >know enough to know what I am looking for? > >ideas? I posted an earlier plain look-ahead/behind assertion rx. But, this won't work because of fixed width look behind. So this friend, is a bullet proof way to do what you want. Finally, a use for new 5.10 regex recursion code, which allows for nested tags. I've thoroughly tested this code. Taking into account the 'restraints' of parsing markup (ie: validity), but thats the compromise you are making for speed. The regex will go along happily matching tags (in a nested fashion), or, the terms you specify. If any terms are inside of the tags (even nested), they are consumed without any substitution (ie: they are left alone). The only thing left to match are the terms themselves. Both match, nested tags or terms, in an alternation (one or the other). The reason the tags aren't substituted for themselves (ie its capture group) is because of the new '\K' which excludes the tags. Read about the new extended expressions here -> 'perlre' in perldocs. Also, in addition to tags, tag-attribute form is included as well: <$tag></$tag> or <$tag attrib></$tag>. Good luck! -sln ------------------- Output: String = ' <apply john Johnny match> Sarah likes Johnny's cooking <apply john match> Sarah likes Johnny's cooking <span id="medium_rectangle" class="_fwph"> Because Johnny does good cooking </span> King John ' Terms = Johnny|john - replaced 5 ' <apply <span>john</span> <span>Johnny</span> match> Sarah likes <span>Johnny</span>'s cooking <apply <span>john</span> match> Sarah likes <span>Johnny</span>'s cooking <span id="medium_rectangle" class="_fwph"> Because Johnny does good cooking </span> King John ' (?i)King|john - replaced 4 ' <apply <span>john</span> <span>Johnny</span> match> Sarah likes <span>Johnny</span>'s coo<span>king</span> <apply <span>john</span> match> Sarah likes <span>Johnny</span>'s coo<span>king</span> <span id="medium_rectangle" class="_fwph"> Because Johnny does good cooking </span> <span>King</span> <span>John</span> ' --------------------------------- use strict; use warnings; require 5.010_000; ## globs .. my ($string, $result) = qq{ <apply john Johnny match> Sarah likes Johnny's cooking <apply john match> Sarah likes Johnny's cooking <span id="medium_rectangle" class="_fwph"> Because Johnny does good cooking </span> King John }; ## code .. print "\nString = \n'$string'\n\nTerms =\n"; print "\nJohnny|john - replaced "; # $result = getMatch( $string, 'span', 'Johnny|john'); print "$result\n"; print "'$string'\n" if $result; print "\n(?i)King|john - replaced "; # $result = getMatch( $string, 'span', '(?i)King|john'); # case insensitive print "$result\n"; print "'$string'\n" if $result; exit(0); ## subs .. sub getMatch { #* USES RX RECURSION '(?#)', new to 5.10 #* Start/End tags must have this specific form: #* <$tag></$tag> or <$tag attrib></$tag> #* -------------------------------------- my ($tag,$terms) = @_[1,2]; my $start = "<$tag(?:\\s+|>)"; # allow <tag> or <tag attribute> my $end = "</$tag>"; my $replaced = 0; $_[0] =~ s { # match .. ( # 1 $start (?: (?:(?!$start|$end).)++ # no backtracking | (?1) # recurse group 1 )* $end ) \K # effecient -- don't include tag data in match | ( # 2 $terms ) } { # replace .. $replaced++, "<$tag>".$2."</$tag>" if defined $2 }xsge; return $replaced; } __END__ |
Re: "negative" regex matching?
>
> * * s{(Johnny|john)} *{<span>$1</span>}gi; > Hi Ted, this was perfect. I was way over-thinking this. Thanks |
| All times are GMT. The time now is 10:23 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.