Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How it works?(about while loop and regex as condition)

Reply
Thread Tools

How it works?(about while loop and regex as condition)

 
 
havel.zhang
Guest
Posts: n/a
 
      10-06-2008
dear perl-gurus,
i don't understand how this function works. can you please give me
further
explanation:

the program is very simple:
+++++++++++++program++++++++++++++++++++++
open (O,"<z.html");
@l = <O>;
close(O);

foreach(@l){
if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){
$html=$_;
while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}
};
++++++++z.html content+++++++++++++++++++++
the z.html 's content is:
<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
A><A HREF=
"fes.iso">fes.iso</A>
+++++++and output is:++++++++++++++++++++++++++++
HREF="http://10.123.111.11"
>link1

HREF="text.txt"
>text.txt

HREF="fes.iso"
>fes.iso

++++++++end+++++++++++++++++++++++++++++++++

I want to using this program pick out hrefs and lables like
"link1","text.txt","fes.iso".
This program works well, but i can't understand the while loop with
regex:
"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
^^^^^^^^^^^^^^^^^^^^^^^
it's works fine, and so amazing everytime, it's pick out patten "<a
href=...></a>" and get right result. But HOW does it work? I think it
will always pick out the first matched patten.

Can any perl guru give me answer?

Thank you

Havel


 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      10-06-2008
"havel.zhang" <(E-Mail Removed)> wrote:
[...]
>This program works well, but i can't understand the while loop with
>regex:
> "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
> ^^^^^^^^^^^^^^^^^^^^^^^
>it's works fine, and so amazing everytime, it's pick out patten "<a
>href=...></a>" and get right result. But HOW does it work? I think it
>will always pick out the first matched patten.
>
>Can any perl guru give me answer?


The documentation can. See 'perldoc perlop', section 'Quote and
quote-like operators', the two paragraphs beginning with
"The "/g" modifier specifies global pattern matching--that is, ..."

However, it is not surprising that you didn't find it. The whole perlop
man page is about 2000 lines long. That is way too long and complex. It
is almost impossible to find anything there or to point people to
specific part of it. Is someone already working on breaking it down into
more managable chunks?

jue
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-06-2008
On Mon, 6 Oct 2008 02:41:30 -0700 (PDT), "havel.zhang" <(E-Mail Removed)> wrote:

>dear perl-gurus,
>i don't understand how this function works. can you please give me
>further
>explanation:
>
>the program is very simple:
>+++++++++++++program++++++++++++++++++++++
>open (O,"<z.html");
>@l = <O>;
>close(O);
>
>foreach(@l){
> if ($_ =~ /<a\b([^>]+)(.*?)<\/a>/ig){

^^ might need a while here

> $html=$_;
> while($html =~ m{a\b([^>]+)(.*?)</a>}ig){

does the same thing as above, could even add the '<'
m{<a\b([^>]+)(.*?)</a>}ig
the if ($_ =~ /.. is not needed

> my $Guts = $1;
> my $Link = $2;
> print "$Guts\n$Link\n";
> }
> }
>};
>++++++++z.html content+++++++++++++++++++++
>the z.html 's content is:
> <A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</
>A><A HREF=
>"fes.iso">fes.iso</A>
>+++++++and output is:++++++++++++++++++++++++++++
> HREF="http://10.123.111.11"
>>link1

> HREF="text.txt"
>>text.txt

> HREF="fes.iso"
>>fes.iso

>++++++++end+++++++++++++++++++++++++++++++++
>
>I want to using this program pick out hrefs and lables like
>"link1","text.txt","fes.iso".
>This program works well, but i can't understand the while loop with
>regex:
> "$html =~ m{a\b([^>]+)(.*?)</a>}ig"
> ^^^^^^^^^^^^^^^^^^^^^^^

the modifier 'g' will continue the match until the end of string.

The problem is the first 'if' regex will only match the first occurance.
Does the same as the inner match except only once. Why do you need the outer 'if'
then?

>it's works fine, and so amazing everytime, it's pick out patten "<a
>href=...></a>" and get right result. But HOW does it work? I think it
>will always pick out the first matched patten.
>
>Can any perl guru give me answer?
>
>Thank you
>
>Havel
>

use strict;
use warnings;

my $str = '<A HREF="http://10.123.111.11">link1</A><A HREF="text.txt">text.txt</A><A HREF="fes.iso">fes.iso</A>';

print "Output from 'if \$str':\n---------------\n";
if ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from 'while \$str':\n---------------\n";
while ($str =~ /(<a\b([^>]+)(.*?)<\/a>)/ig)
{
print "found: '$1'\n\n";
my $html = $1;
while ($html =~ m{a\b([^>]+)(.*?)</a>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}
}

pos ($str) = 0;

print "\n\nOutput from just 'while \$html':\n---------------\n";
while ($str =~ m{<a\s*([^>]+)(.*?)</a\s*>}ig)
{
my $Guts = $1;
my $Link = $2;
print "$Guts\n$Link\n";
}

__END__


Output from 'if $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"
>link1



Output from 'while $str':
---------------
found: '<A HREF="http://10.123.111.11">link1</A>'

HREF="http://10.123.111.11"
>link1

found: '<A HREF="text.txt">text.txt</A>'

HREF="text.txt"
>text.txt

found: '<A HREF="fes.iso">fes.iso</A>'

HREF="fes.iso"
>fes.iso



Output from just 'while $html':
---------------
HREF="http://10.123.111.11"
>link1

HREF="text.txt"
>text.txt

HREF="fes.iso"
>fes.iso



In general it doesn't work fine. You can run into problems if the phrase your
looking for spans lines. Also problematic is your regex does not account for
legal white spaces.

The better regex would be: "while ( m{<a\s*([^>]+)(.*?)</a\s*>}ig ) {}"

Its always good to have delimeters surrounding what you are trying to match.
In your case the '<a ...></a>' the 'a' tag being the delimeters.

This will grab inner non 'a' tags, nested 'a' tags however, will not work.
Because of nesting, html/xml can't be parsed this way, seeking the end delimeter.
But in your case it should be ok.

In general, should you need to do specific parsing, you should get a parser that
captures groups of phrases, from which you can parse with reliability.


==================================================
use strict;
use warnings;

use RXParse; # VERSIN 2

my $p = new RXParse();
$p->setMode( 'html' => 1, 'resume_onerror'=> 1 );
my %oldh = $p->setHandlers('start' => \&starth, 'end' => \&endh);

sub starth
{
my ($obj, $el, $term, @attr) = @_;
my $buffer = lc($el);
$obj->CaptureOn( $buffer ) if ($buffer eq 'a');
}
sub endh
{
my ($obj, $el, $term) = @_;
my $buffer = lc($el);
$obj->CaptureOff( $buffer, 1 ) if ($buffer eq 'a');
}

open my $fh, 'c:\temp\z.html' or die "can't open z.html...";
$p->parse($fh);
close $fh;

# get and parse capture buffer 'a'
# ....

# display 'a'
$p->DumpCaptureBuffs();


__END__


BUFFER: a
=====================================
index seqence
----- --------
[0] 1 <A HREF="http://10.123.111.11">link1</A>
[1] 2 <A HREF="text.txt">text.txt</A>
[2] 3 <A HREF="fes.iso">fes.iso</A>


 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      10-06-2008
Jürgen Exner schreef:

> The whole
> perlop man page is about 2000 lines long. That is way too long and
> complex. It is almost impossible to find anything there or to point
> people to specific part of it. Is someone already working on breaking
> it down into more managable chunks?



You could generate something like

-------------------------
=head2 TABLE OF CONTENTS

=over 2

=item L</Operator Precedence and Associativity>

=item L</Terms and List Operators (Leftward)>

=item L</The Arrow Operator>

=item etc. etc.

=back

-------------------------

before the "=head1 DESCRIPTION" line,

and use

perldoc -oHtml perlop | lynx -stdin

to have a viewer that is easier to navigate.

Something like "info" would also be nicer than the default man view.

Or use http://perldoc.perl.org/perlop.html

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
havel.zhang
Guest
Posts: n/a
 
      10-07-2008
On Oct 6, 9:34*pm, Jürgen Exner <(E-Mail Removed)> wrote:
> "havel.zhang" <(E-Mail Removed)> wrote:
>
> [...]
>
> >This program works well, but i can't understand the *while loop with
> >regex:
> > * * * * * * * *"$html =~ m{a\b([^>]+)(.*?)</a>}ig"
> > * * * * * * * * ^^^^^^^^^^^^^^^^^^^^^^^
> >it's works fine, and so amazing everytime, it's pick out patten "<a
> >href=...></a>" *and get right result. But HOW does it work? I think it
> >will always pick out the first matched patten.

>
> >Can any perl guru give me answer?

>
> The documentation can. See 'perldoc perlop', section 'Quote and
> quote-like operators', the two paragraphs beginning with
> "The "/g" modifier specifies global pattern matching--that is, ..."
>
> However, it is not surprising that you didn't find it. The whole perlop
> man page is about 2000 lines long. That is way too long and complex. It
> is almost impossible to find anything there or to point people to
> specific part of it. Is someone already working on breaking it down into
> more managable chunks?
>
> jue


Thank you jue:
After I post my question on news group, I found answer in a perl
book. That book point out the function which a regex with /g modifier
as condition in while loop, as you point out above. It's so easy and
amazing
Thank you again

Havel
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-07-2008
On Mon, 6 Oct 2008 02:41:30 -0700 (PDT), "havel.zhang" <(E-Mail Removed)> wrote:

>dear perl-gurus,
>i don't understand how this function works. can you please give me
>further
>explanation:
>

I tried but you didn't listen.
The function does not work well for what you are doing.
Not at all, never will.

<snip>

>I want to using this program pick out hrefs and lables like
>"link1","text.txt","fes.iso".


No you don't, this is not how to do it. It fails easily.

>Can any perl guru give me answer?


The answer was given in detail, and at great time expense.
Next time there will be no answer.

sln

 
Reply With Quote
 
Tim Greer
Guest
Posts: n/a
 
      10-07-2008
havel.zhang wrote:

> foreach(@l){
> ifÂ*($_Â*=~Â*/<a\b([^>]+)(.*?)<\/a>/ig){
> $html=$_;
> while($html =~ m{a\b([^>]+)(.*?)</a>}ig){
> my $Guts = $1;
> my $Link = $2;
> print "$Guts\n$Link\n";
> }
> }


It steps through the @l array, and for each element within it, it checks
$_ (which is by default the value of the for/foreach/while, so you
don't actually need to declare it).

It then checks that $_ can find an opening HTML tag that starts with
"a", which is an anchor (hot link), most likely anyway with a word
boundary \b to ensure it's not some other tag that starts with "a",
such as <applet> (just an example), and takes anything that's not an
ending HTML tag (>) an captures it into $1. Then, it captures anything
else between that last match and the ending anchor tag (</a> -- seen as
<\/a>) and captures it into $2. It does this check globally and
without letter case. Of course, that regex doesn't make sense, and
neither does the check, to be honest, but no matter.

After the above check, which I assume is to see if there's a matching
anchor tag, and if there is, then it continues, it then assigns the
$html variable the value of $_, does a while look and case
insensitively and globally, checks for the same exact thing it just did
above and assigns and prints the $Guts and $Link variables the values
of the first and second match it captured ($1 and $2, respectively) and
prints it out. The above code really isn't very good and doesn't make
sense, it's repeating things that can be done in one check, it captures
values it's never going to use, etc. It should instead just use the
one and even that one is not correct. It should be

m{a\b([^>]+)>(.*?)</a>}ig

Notice the addition of ">" between ([^>]+) and (.*?). Otherwise $2 will
always start with < (is that what you want? It also would match any
non valid values when checking the anchor tag, which doesn't seem like
it would do any good. If it works, great, but there are some wastes of
processing and bugs so you should expect the unexpected if you run it
against many HTML files.
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Triple nested loop python (While loop insde of for loop inside ofwhile loop) Isaac Won Python 9 03-04-2013 10:08 AM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Whats the difference between while loop in Windows message loop and while(1) Uday Bidkar C++ 4 12-12-2006 12:30 PM
while loop in a while loop Steven Java 5 03-30-2005 09:19 PM



Advertisments