Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > <p>(.*)</p> Doesn't Work

Reply
Thread Tools

<p>(.*)</p> Doesn't Work

 
 
Howard Best
Guest
Posts: n/a
 
      06-14-2006
When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.

2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?

What is the solution?

 
Reply With Quote
 
 
 
 
Brian Wakem
Guest
Posts: n/a
 
      06-14-2006
Howard Best wrote:

> When trying to match HTML paragraphs using Perl:
>
> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?
>
> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.



Put a ? after the *


> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?



You can't have 2 number 2's!


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      06-14-2006

Howard Best wrote:
> When trying to match HTML paragraphs using Perl:


.... you should be using a module specifically designed for HTML
parsing, like, for example, HTML:arser.

Regular expressions are simply not up to the task.

>
> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?


Then you'd have to either put *all* the text into $buffer, or set up
markers as you're going through all the lines - one to find the opening
<p>, one to find the closing </p>.

Btw, what do you think the above is doing? You're saying to find all
instances of text between <p and </p>, and to add <p> and </p> tags
around it? So that would produce:
<p<p>This is my paragraph</p></p>
What is the point of such a thing?

> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.


No, it matches the first <p, and the () capture EVERYTHING that it can
and still allow the pattern to succeed, because you told the pattern to
be greedy. If you want it to be non-greedy, add a ? after the *

> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?


Yup. Don't do that.

>
> What is the solution?


To use a module that is made for parsing HTML, like HTML:arser.

Paul Lalli

 
Reply With Quote
 
nsb_tsd@eml.cc
Guest
Posts: n/a
 
      06-14-2006

> When trying to match HTML paragraphs using Perl:


I was just doing the same thing..

Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
original HTML, so I'd suggest always printing the variable before you
=~ it (thanks for the tips, Bart & Gleixner!)


> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?
>


Use the match modifier s.

> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.


..* matching is greedy by default. There's afaik a switch to ungreedify
it.


> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?


To get just the para you could try other things such as HTML
Treebuilder. Works well, but memory hungry.

> What is the solution?


42, of course

Here's what I used in a similar situation:

print "\n ==\n\tContent of VV page: $content\n\n";
$content =~ m/navbar(.*)<\/TABLE><BR>/ism;
print "I think tbl is approx:\n $1\n";
$tbl=$1;
my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
$infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
print "Found Valid Values:\n$infostr \nSkipped Value:
$info_to_keep[0]\n\n";


In the code above, rather than find the 'exact' html table, I opted for
'pseudo-semantic' (ie, unique) strings to cut the search space down.

I am looking for rows within an html table. So first I =~ out an
approximate chunk of text containing the table (without bothering about
precise start and end tags).

s is for matching .* across \n's -- note that by default it doesn't.
g matches multiple times, and the result is returned in list context.

m is for multi-line matching, not sure if s is necessary when m is
present.

 
Reply With Quote
 
Howard Best
Guest
Posts: n/a
 
      06-14-2006
Brian Wakem wrote:

> Put a ? after the *


Thanks, Brian. That did it! Here's a portion of the code that I used to
test it:

open(IN,$filename) or die "Can't open \"$filename\": $!.\n";
@buffer=<IN>;
close(IN);
$buffer=join('',@buffer);
while($buffer=~s/(<p.*?<\/p>)//s)
{
print OUT "\n*****************\n$1\n*****************\n" ;
}

> > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> > there's a <b>...</b>, etc. within the paragraph?

>
>
> You can't have 2 number 2's!


Sorry about that. It's that ol' senility kicking in!

 
Reply With Quote
 
Howard Best
Guest
Posts: n/a
 
      06-14-2006
Paul Lalli wrote:
> ... you should be using a module specifically designed for HTML
> parsing, like, for example, HTML:arser.


Thanks, Paul. I'll check it out.

Howard

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      06-14-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
>
>> When trying to match HTML paragraphs using Perl:

>
> I was just doing the same thing..




> $content =~ m/navbar(.*)<\/TABLE><BR>/ism;



m//m affects the meaning of ^ and $, it is useless when
your pattern does not use those anchors.


> my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;



There is a module specifically for prying the data out of HTML tables:

use HTML::TableExtract;


> s is for matching .* across \n's



Actually, m//s makes dot match a newline (whether the dot is asterisked or not).

> g matches multiple times, and the result is returned in list context.



The "g" modifier has absolutely no connection with the context that
the m// operator is in!

It is the assignment (=) that puts the m// in list context, not
the "g" modifier.


> m is for multi-line matching, not sure if s is necessary when m is
> present.



They do different things, so the presence of one has nothing
to do with the other.

If you want dot to match a newline use "s".

If you want ^ and & to match "lines" rather than "strings", use "m".


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
RE;Kontki if you delete kontiki any program you loaded with it in it 'will not work I have tried it with three programs and none work anymore (if you se it just stop download) 1-Twitch Computer Support 5 04-23-2009 02:45 PM
MS work around on text wrapping in a datagrid does not work TB ASP .Net 2 02-22-2006 10:34 PM
Hi I am new to asp i can not get it to work on xp pro sp2 even though the localhost work but asp pages dont so can some one help craig dicker ASP .Net 9 07-07-2005 11:52 AM
Re: Those cute little "WORK-SAFE" / "NOT WORK-SAFE" tags that people put in the Subject headers of their posts... Soapy Digital Photography 1 08-16-2004 12:07 PM
Re: Those cute little "WORK-SAFE" / "NOT WORK-SAFE" tags that people put in the Subject headers of their posts... Soapy Digital Photography 1 08-16-2004 06:24 AM



Advertisments