Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regular Expression

Reply
Thread Tools

Regular Expression

 
 
fritz-bayer@web.de
Guest
Posts: n/a
 
      09-07-2007
Hi,

I 'm looking for a regular expression, which will find a certain word
in a text and replace it, if and only if it does not appear inside an
a html link or inside a tag, for example as an attribute or tag name.

So, for example the following text should not match and be replaced:

<a href='/index.html'>WORD TO MATCH</a> ....
<image alt='WORD TO MATCH' src='../image.gif'> ..

but the following should be replaced

<body><h1>WORD TO MATCH</h1>...

I guess I would have to use a positive lookahead or lookaround
construct to achieve this. I have tried, but could not come up with
anything that will do the job.

Can some pro help me out?

Fritz

 
Reply With Quote
 
 
 
 
Klaus
Guest
Posts: n/a
 
      09-07-2007
On Sep 7, 2:28 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag


see Perlfaq 4 - How do I find matching/nesting anything?

==================================
This isn't something that can be done in one regular expression, no
matter how complicated. To find something between two single
characters, a pattern like /x([^x]*)x/ will get the intervening bits
in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns. For
balanced expressions using (, {, [ or < as delimiters, use the CPAN
module Regexp::Common, or see (??{ code }) in the perlre manpage. For
other cases, you'll have to write a parser.

If you are serious about writing a parser, there are a number of
modules or oddities that will make your life a lot easier. There are
the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
and the byacc program. Starting from perl 5.8 the Text::Balanced is
part of the standard distribution.

One simple destructive, inside-out approach that you might try is to
pull out the smallest nesting parts one at a time:

while (s/BEGIN((??!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}

A more complicated and sneaky approach is to make Perl's regular
expression engine do it for you. This is courtesy Dean Inada, and
rather has the nature of an Obfuscated Perl Contest entry, but it
really does work:

# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.

@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
==================================

--
Klaus

 
Reply With Quote
 
 
 
 
Benoit Lefebvre
Guest
Posts: n/a
 
      09-07-2007
On Sep 7, 8:28 am, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> Hi,
>
> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag, for example as an attribute or tag name.
>
> So, for example the following text should not match and be replaced:
>
> <a href='/index.html'>WORD TO MATCH</a> ....
> <image alt='WORD TO MATCH' src='../image.gif'> ..
>
> but the following should be replaced
>
> <body><h1>WORD TO MATCH</h1>...
>
> I guess I would have to use a positive lookahead or lookaround
> construct to achieve this. I have tried, but could not come up with
> anything that will do the job.
>
> Can some pro help me out?
>
> Fritz


I'm sure there is some WAY BETTER WAY to do this..

But here is a solutions that seems to work.

----------------8<--------------------------------------
#!/usr/bin/perl -w

use strict;

my $to_replace = "WORD";
my $replacement = "BLEH";

my @list = ("<a href='/index.html'>WORD</a> ....",
"<image alt='WORD' src='../image.gif'> ..",
"<body><h1>this is my WORD !</h1>... ");

foreach my $line (@list) {
if ($line =~ m/>([^<]*$to_replace[^>]*)</) {
my $match = $1;
$match =~ s/$to_replace/$replacement/g;
$line =~ s/>([^<]*$to_replace[^>]*)</>$match</g;
}
print $line . "\n";
}
--------------------------------------------------------

output:
<a href='/index.html'>BLEH</a> ....
<image alt='WORD' src='../image.gif'> ..
<body><h1>this is my BLEH !</h1>...

 
Reply With Quote
 
fritz-bayer@web.de
Guest
Posts: n/a
 
      09-07-2007
On 7 Sep., 17:41, Klaus <(E-Mail Removed)> wrote:
> On Sep 7, 2:28 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
>
> > I 'm looking for a regular expression, which will find a certain word
> > in a text and replace it, if and only if it does not appear inside an
> > a html link or inside a tag

>
> see Perlfaq 4 - How do I find matching/nesting anything?
>
> ==================================
> This isn't something that can be done in one regular expression, no
> matter how complicated. To find something between two single
> characters, a pattern like /x([^x]*)x/ will get the intervening bits
> in $1. For multiple ones, then something more like /alpha(.*?)omega/
> would be needed. But none of these deals with nested patterns. For
> balanced expressions using (, {, [ or < as delimiters, use the CPAN
> module Regexp::Common, or see (??{ code }) in the perlre manpage. For
> other cases, you'll have to write a parser.
>
> If you are serious about writing a parser, there are a number of
> modules or oddities that will make your life a lot easier. There are
> the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced;
> and the byacc program. Starting from perl 5.8 the Text::Balanced is
> part of the standard distribution.
>
> One simple destructive, inside-out approach that you might try is to
> pull out the smallest nesting parts one at a time:
>
> while (s/BEGIN((??!BEGIN)(?!END).)*)END//gs) {
> # do something with $1
> }
>
> A more complicated and sneaky approach is to make Perl's regular
> expression engine do it for you. This is courtesy Dean Inada, and
> rather has the nature of an Obfuscated Perl Contest entry, but it
> really does work:
>
> # $_ contains the string to parse
> # BEGIN and END are the opening and closing markers for the
> # nested text.
>
> @( = ('(','');
> @) = (')','');
> ($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
> @$ = (eval{/$re/},$@!~/unmatched/i);
> print join("\n",@$[0..$#$]) if( $$[-1] );
> ==================================
>
> --
> Klaus



Well, I would know if it's possible, but positive and negative
lookaheads seem to be something to consider. The following shows how:

http://frank.vanpuffelen.net/2007/04...xpression.html

 
Reply With Quote
 
Klaus
Guest
Posts: n/a
 
      09-07-2007
On Sep 7, 4:51 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> On 7 Sep., 17:41, Klaus <(E-Mail Removed)> wrote:
>
> > On Sep 7, 2:28 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:

>
> > > I 'm looking for a regular expression, which will find a certain word
> > > in a text and replace it, if and only if it does not appear inside an
> > > a html link or inside a tag

>
> > see Perlfaq 4 - How do I find matching/nesting anything?


[ snip contents of Perlfaq 4 ]

> Well, I would know if it's possible, but positive and negative
> lookaheads seem to be something to consider. The following shows how:
>
> http://frank.vanpuffelen.net/2007/04...xpression.html


The document claims:
" [...] apparently there aren't many good HTML parsers available
for .NET [...] "

That might be true for .NET, but as far as Perl is concerned, there
are many HTML parsers available on CPAN, and HTML:arser looks
perfect for the job (although I would have to admit that I haven't yet
tested it myself) :

http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

========================================
Here is an extract from the HTML:arser documentation:
========================================
HTML:arser is not a generic SGML parser. We have tried to make it
able to deal with the HTML that is actually "out there", and it
normally parses as closely as possible to the way the popular web
browsers do it instead of strictly following one of the many HTML
specifications from W3C. Where there is disagreement, there is often
an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This
makes on-the-fly parsing as documents are received from the network
possible.

If event driven parsing does not feel right for your application, you
might want to use HTML:ullParser. This is an HTML:arser subclass
that allows a more conventional program structure.
========================================

--
Klaus

 
Reply With Quote
 
fritz-bayer@web.de
Guest
Posts: n/a
 
      09-07-2007
On 7 Sep., 18:28, Klaus <(E-Mail Removed)> wrote:
> On Sep 7, 4:51 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
>
> > On 7 Sep., 17:41, Klaus <(E-Mail Removed)> wrote:

>
> > > On Sep 7, 2:28 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:

>
> > > > I 'm looking for a regular expression, which will find a certain word
> > > > in a text and replace it, if and only if it does not appear inside an
> > > > a html link or inside a tag

>
> > > see Perlfaq 4 - How do I find matching/nesting anything?

>
> [ snip contents of Perlfaq 4 ]
>
> > Well, I would know if it's possible, but positive and negative
> > lookaheads seem to be something to consider. The following shows how:

>
> >http://frank.vanpuffelen.net/2007/04...gular-expressi...

>
> The document claims:
> " [...] apparently there aren't many good HTML parsers available
> for .NET [...] "
>
> That might be true for .NET, but as far as Perl is concerned, there
> are many HTML parsers available on CPAN, and HTML:arser looks
> perfect for the job (although I would have to admit that I haven't yet
> tested it myself) :
>
> http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
>
> ========================================
> Here is an extract from the HTML:arser documentation:
> ========================================
> HTML:arser is not a generic SGML parser. We have tried to make it
> able to deal with the HTML that is actually "out there", and it
> normally parses as closely as possible to the way the popular web
> browsers do it instead of strictly following one of the many HTML
> specifications from W3C. Where there is disagreement, there is often
> an option that you can enable to get the official behaviour.
>
> The document to be parsed may be supplied in arbitrary chunks. This
> makes on-the-fly parsing as documents are received from the network
> possible.
>
> If event driven parsing does not feel right for your application, you
> might want to use HTML:ullParser. This is an HTML:arser subclass
> that allows a more conventional program structure.
> ========================================
>
> --
> Klaus


I'm looking for a regular expression, which is plattform independet
and works for java, perl or net.

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      09-07-2007

Quoth "(E-Mail Removed)" <(E-Mail Removed)>:
>
> I'm looking for a regular expression, [to parse HTML] which is
> plattform independet and works for java, perl or net.


<sigh> Here we go again. Clpmisc is for discussing Perl. If you want to
discuss Java or .NET their newsgroups are -->thataway.

In any case, regular expressions (and Perl5 regexps, which are not quite
the same thing) are not an appropriate tool to parse HTML with. If you
have a limited set of documents you may be able to hack up something
that works, but it will be fragile.

Now, did you have a Perl question?

Ben

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      09-07-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:

> I 'm looking for a regular expression, which will find a certain word
> in a text and replace it, if and only if it does not appear inside an
> a html link or inside a tag, for example as an attribute or tag name.


> Can some pro help me out?



Sure.

A regular expression is not the Right Tool for this job.

Use a real parser instead.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
fritz-bayer@web.de
Guest
Posts: n/a
 
      09-11-2007
On 8 Sep., 07:50, Joe Smith <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > I'm looking for a regular expression, which is plattform independet
> > and works for java, perl or net.

>
> I'd say you have an impossible task. The advanced parts of perl
> regular expressions that almost do what you want are not implemented
> the same way (if at all) on the other platforms.
>
> -Joe



What about finding all words which are not inside a href tag? So if
I'm looking for the word OUTSIDE, then it should match, if it's not
inside a href. So the following should not match
<a href='/somethin.html'>OUTSIDE</a>

but this should match twice!

OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE

Can somebody come up with a regular expression that does the job?

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      09-11-2007
(E-Mail Removed) <(E-Mail Removed)> wrote:
> On 8 Sep., 07:50, Joe Smith <(E-Mail Removed)> wrote:
>> (E-Mail Removed) wrote:
>> > I'm looking for a regular expression, which is plattform independet
>> > and works for java, perl or net.

>>
>> I'd say you have an impossible task. The advanced parts of perl
>> regular expressions that almost do what you want are not implemented
>> the same way (if at all) on the other platforms.
>>
>> -Joe

>
>
> What about finding all words which are not inside a href tag? So if
> I'm looking for the word OUTSIDE, then it should match, if it's not
> inside a href. So the following should not match
><a href='/somethin.html'>OUTSIDE</a>
>
> but this should match twice!
>
> OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE



So the below should match twice also?

<!--
OUTSIDE <a href='/somethin.html'>SOME OTHER TEXT</a> OUTSIDE
-->

And the below should match once (since it doess not appear in an anchor)?

<!--
<a href='/somethin.html'>OUTSIDE</a>
-->


> Can somebody come up with a regular expression that does the job?



A regular expression is not the Right Tool for this job.

Use a real parser instead.

Strip all of the anchor elements, then match against what remains.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Seek xpath expression where an attribute name is a regular expression GIMME XML 3 12-29-2008 03:11 PM
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C++ 42 11-04-2008 12:39 PM
C/C++ language proposal: Change the 'case expression' from "integral constant-expression" to "integral expression" Adem C Programming 45 11-04-2008 12:39 PM
Matching abitrary expression in a regular expression =?iso-8859-1?B?bW9vcJk=?= Java 8 12-02-2005 12:51 AM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments