Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex multi-line match/replace issue

Reply
Thread Tools

regex multi-line match/replace issue

 
 
seven.reeds
Guest
Posts: n/a
 
      04-24-2006
Hi,

I'm running perl v5.8.7

I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want. I just need it to
be improved a bit and that's why I am here.

the code so far is:

use strict;
select(STDIN);
$|++;

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;
my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}

So the "while" looks to see if there is a starting "<A" tag. If there
is then I reset the text line to the portion of the text following the
initial match "$text = $';". Next, I look to find a closing "</a>" tag
and stih the pre-match portion in "$tmp".

ignore the commented out lines for a second... then I print out $tmp
and "increment the file-string past the closing A tag.

Again, this works. It is spitting out the text i expect. but now we
come to the commented out lines.

I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.

any ideas?

 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-24-2006
"seven.reeds" <(E-Mail Removed)> wrote in
news:(E-Mail Removed) oups.com:

> I have a series of files with html tags in them. I man NOT trying to
> strip the tags I am however trying to list the "link phrases"
> associated with all of the "<a href=...>link phrase</a>" sequences in
> each file. I have a script that does what I want.


You should use an HTML parser to parse HTML.

> use strict;


use warnings;

> select(STDIN);
> $|++;


$| = 1;

> my $sep = $/;
> undef $/;
> my $text = <>;
> $/ = $sep;


Aaargh!

my $text = do { local $/; <> };

Actually, I would just use File::Slurp;

> my $tmp = "";
>
> while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
> {
> $text = $';
> if ($text =~ /<\s*\/A\s*>/is)
> {
> $tmp = $`;
> #$tmp =~ s/^\s+/ /sg;
> #$tmp =~ s/\s+$/ /sg;
> #$tmp =~ s/\s+/ /sg;
> print STDOUT ">>>$tmp<<<\n";
> $text = $';
> }
> }


....

> I am trying to pretty-up the text I find by stripping off
> leading/trailing whitespece and compressing internal whitespace.
> Except that bit isn[t working.


As I said, use an HTML parser to parse HTML.

Anyway, no need to reinvent to wheel. You can adapt:

http://search.cpan.org/src/GAAS/HTML...51/eg/hanchors

Sinan
--
A. Sinan Unur <(E-Mail Removed)>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc...uidelines.html

 
Reply With Quote
 
 
 
 
seven.reeds
Guest
Posts: n/a
 
      04-24-2006
Thanks

The anchors script is largely what i am looking for.

all the best

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-24-2006
seven.reeds <(E-Mail Removed)> wrote:

> I have a series of files with html tags in them. I man NOT trying to
> strip the tags



Nonetheless, the primary point in the "How do I remove HTML from a string?"
FAQ answer is: don't use regular expressions for this.


> I am however trying to list the "link phrases"
> associated with all of the "<a href=...>link phrase</a>" sequences in
> each file.



I would recommend using a module that already does that for you, such as:

http://search.cpan.org/~bdfoy/HTML-S...leLinkExtor.pm


> I have a script that does what I want.



I think it only "appears" to do what you want.

You just haven't tried it with a test case that trips it up yet.


> I just need it to
> be improved a bit



It is a dirty hack.

If proper operation is of importance, then it needs to be thrown
away and replaced with something more robust.


> and that's why I am here.



OK. So let's patch it up anyway, just as a "learning exercise".


> my $sep = $/;
> undef $/;
> my $text = <>;
> $/ = $sep;



Let Perl do the save-and-restore for you. This does the same thing:

my $text;
{ local $/; # a naked block creates a scope
$text = <>;
}
# $/ has been restored to its previous value here


Or, probably even better:

my $text = do { local $/; <> };


> my $tmp = "";
>
> while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

^^^
^^^

Spaces are not allowed there, so you should not allow spaces there.

The m//s modifier changes the meaning of dot, it is useless when
your pattern contains no dot.


> {
> $text = $';
> if ($text =~ /<\s*\/A\s*>/is)



No unallowed spaces, no "s" modifier, as above.

If you choose an alternate delimiter for your m//, then you
won't have to backslash slashes:

if ($text =~ m#</A\s*>#i)


> {
> $tmp = $`;
> #$tmp =~ s/^\s+/ /sg;
> #$tmp =~ s/\s+$/ /sg;
> #$tmp =~ s/\s+/ /sg;
> print STDOUT ">>>$tmp<<<\n";
> $text = $';
> }
> }



Try your code with these:

<a name="perl" href="http://www.perl.org">Perl Mongers</a>

<a href="http://www.perl.org" name=">>>perl<<<">Perl Mongers</a>

<!--
<a href="not_a_link.com">Don't report me as a link!</a>
-->


> any ideas?



Start over (with a module).


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
DJ Stunks
Guest
Posts: n/a
 
      04-25-2006

Tad McClellan wrote:
> seven.reeds <(E-Mail Removed)> wrote:
>
> > I am however trying to list the "link phrases"
> > associated with all of the "<a href=...>link phrase</a>" sequences in
> > each file.

>
>
> I would recommend using a module that already does that for you, such as:
>
> http://search.cpan.org/~bdfoy/HTML-S...leLinkExtor.pm


I don't believe HTML::LinkExtor (upon which HTML::SimpleLinkExtor is
built) extracts the link text, only the link itself.

-jp

 
Reply With Quote
 
Lukas Mai
Guest
Posts: n/a
 
      04-25-2006
seven.reeds <(E-Mail Removed)> schrob:
>
> the code so far is:
>
> use strict;
> select(STDIN);


The other posters seem to have missed this.
select() changes the current _output_ filehandle. I have no idea what
you're trying to achieve by selecting STDIN.

> $|++;


$| changes the behavior of print. This line has no effect as you don't
print to STDIN.

> my $sep = $/;
> undef $/;
> my $text = <>;
> $/ = $sep;


Eww, use File::Slurp or local $/ here.

> my $tmp = "";
>
> while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

^
This /s has no effect. Why did you put it there?

> {
> $text = $';
> if ($text =~ /<\s*\/A\s*>/is)

^
This /s has no effect. Why did you put it there?
> {

[snip]

MJD's Good Advice #11924 comes to mind.

Lukas
--
fflush(stdin) is wrong, too.
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      04-25-2006
Tad McClellan <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> seven.reeds <(E-Mail Removed)> wrote:


[good advice snipped]

> > {
> > $tmp = $`;
> > #$tmp =~ s/^\s+/ /sg;
> > #$tmp =~ s/\s+$/ /sg;
> > #$tmp =~ s/\s+/ /sg;
> > print STDOUT ">>>$tmp<<<\n";
> > $text = $';
> > }
> > }


Apart from everything else, uncommenting the commented substitutions will
change what $' contains at the end of the block. "$text = $'" should
come before any additional matches. Also, the commented s/// do not
strip leading and trailing white space but reduce them to a single blank.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
Java regex imposture re: Perl regex compatibility a_c_Attlee@yahoo.com Java 2 05-06-2005 12:16 AM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments