Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   <p>(.*)</p> Doesn't Work (http://www.velocityreviews.com/forums/t898527-p-p-doesnt-work.html)

Howard Best 06-14-2006 07:51 PM

<p>(.*)</p> Doesn't Work
 
When trying to match HTML paragraphs using Perl:

1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
paragraph is on more than one line?

2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
the first <p and the last </p> in the file.

2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
there's a <b>...</b>, etc. within the paragraph?

What is the solution?


Brian Wakem 06-14-2006 07:54 PM

Re: <p>(.*)</p> Doesn't Work
 
Howard Best wrote:

> When trying to match HTML paragraphs using Perl:
>
> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?
>
> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.



Put a ? after the *


> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?



You can't have 2 number 2's!


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png

Paul Lalli 06-14-2006 08:06 PM

Re: <p>(.*)</p> Doesn't Work
 

Howard Best wrote:
> When trying to match HTML paragraphs using Perl:


.... you should be using a module specifically designed for HTML
parsing, like, for example, HTML::Parser.

Regular expressions are simply not up to the task.

>
> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?


Then you'd have to either put *all* the text into $buffer, or set up
markers as you're going through all the lines - one to find the opening
<p>, one to find the closing </p>.

Btw, what do you think the above is doing? You're saying to find all
instances of text between <p and </p>, and to add <p> and </p> tags
around it? So that would produce:
<p<p>This is my paragraph</p></p>
What is the point of such a thing?

> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.


No, it matches the first <p, and the () capture EVERYTHING that it can
and still allow the pattern to succeed, because you told the pattern to
be greedy. If you want it to be non-greedy, add a ? after the *

> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?


Yup. Don't do that.

>
> What is the solution?


To use a module that is made for parsing HTML, like HTML::Parser.

Paul Lalli


nsb_tsd@eml.cc 06-14-2006 08:33 PM

Re: <p>(.*)</p> Doesn't Work
 

> When trying to match HTML paragraphs using Perl:


I was just doing the same thing..

Note: I'm using the output of Win32::IE::Mechanize, and it reorders the
original HTML, so I'd suggest always printing the variable before you
=~ it (thanks for the tips, Bart & Gleixner!)


> 1. $buffer=~s/<p(.*)</p>/<p>$1</p>/g; doesn't work because what if the
> paragraph is on more than one line?
>


Use the match modifier s.

> 2. $buffer=~s/<p(.*)</p>/<p>$1</p>/sg doesn't work because it matches
> the first <p and the last </p> in the file.


..* matching is greedy by default. There's afaik a switch to ungreedify
it.


> 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> there's a <b>...</b>, etc. within the paragraph?


To get just the para you could try other things such as HTML
Treebuilder. Works well, but memory hungry.

> What is the solution?


42, of course ;-)

Here's what I used in a similar situation:

print "\n ==\n\tContent of VV page: $content\n\n";
$content =~ m/navbar(.*)<\/TABLE><BR>/ism;
print "I think tbl is approx:\n $1\n";
$tbl=$1;
my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;
$infostr = join "\n", @info_to_keep[1 .. $#info_to_keep];
print "Found Valid Values:\n$infostr \nSkipped Value:
$info_to_keep[0]\n\n";


In the code above, rather than find the 'exact' html table, I opted for
'pseudo-semantic' (ie, unique) strings to cut the search space down.

I am looking for rows within an html table. So first I =~ out an
approximate chunk of text containing the table (without bothering about
precise start and end tags).

s is for matching .* across \n's -- note that by default it doesn't.
g matches multiple times, and the result is returned in list context.

m is for multi-line matching, not sure if s is necessary when m is
present.


Howard Best 06-14-2006 08:38 PM

Re: <p>(.*)</p> Doesn't Work
 
Brian Wakem wrote:

> Put a ? after the *


Thanks, Brian. That did it! Here's a portion of the code that I used to
test it:

open(IN,$filename) or die "Can't open \"$filename\": $!.\n";
@buffer=<IN>;
close(IN);
$buffer=join('',@buffer);
while($buffer=~s/(<p.*?<\/p>)//s)
{
print OUT "\n*****************\n$1\n*****************\n" ;
}

> > 2. $buffer=~s/<p([^<]*)</p>/<p>$1</p>/sg doesn't work because what if
> > there's a <b>...</b>, etc. within the paragraph?

>
>
> You can't have 2 number 2's!


Sorry about that. It's that ol' senility kicking in!


Howard Best 06-14-2006 08:45 PM

Re: <p>(.*)</p> Doesn't Work
 
Paul Lalli wrote:
> ... you should be using a module specifically designed for HTML
> parsing, like, for example, HTML::Parser.


Thanks, Paul. I'll check it out.

Howard


Tad McClellan 06-14-2006 10:25 PM

Re: <p>(.*)</p> Doesn't Work
 
nsb_tsd@eml.cc <nsb_tsd@eml.cc> wrote:
>
>> When trying to match HTML paragraphs using Perl:

>
> I was just doing the same thing..




> $content =~ m/navbar(.*)<\/TABLE><BR>/ism;



m//m affects the meaning of ^ and $, it is useless when
your pattern does not use those anchors.


> my @info_to_keep = $tbl =~ m/<TD>(.*?)<\/TD>/img;



There is a module specifically for prying the data out of HTML tables:

use HTML::TableExtract;


> s is for matching .* across \n's



Actually, m//s makes dot match a newline (whether the dot is asterisked or not).

> g matches multiple times, and the result is returned in list context.



The "g" modifier has absolutely no connection with the context that
the m// operator is in!

It is the assignment (=) that puts the m// in list context, not
the "g" modifier.


> m is for multi-line matching, not sure if s is necessary when m is
> present.



They do different things, so the presence of one has nothing
to do with the other.

If you want dot to match a newline use "s".

If you want ^ and & to match "lines" rather than "strings", use "m".


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas


All times are GMT. The time now is 04:53 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.