Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > can some one please explain this regex?!

Reply
Thread Tools

can some one please explain this regex?!

 
 
Geoff Cox
Guest
Posts: n/a
 
      12-07-2003
Hello,

this comes from my posting re how to match more than 1 line (from
Gunnar) but would appreciate any one just explaining what is matched
as the code does not work for me. If I could learn from this I could
probably sort it out for myself ..

Thanks

Geoff

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
 
Reply With Quote
 
 
 
 
Matt Garrish
Guest
Posts: n/a
 
      12-07-2003

"Geoff Cox" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Hello,
>
> this comes from my posting re how to match more than 1 line (from
> Gunnar) but would appreciate any one just explaining what is matched
> as the code does not work for me. If I could learn from this I could
> probably sort it out for myself ..
>
>


To break it down piece by piece:

/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is

matches "head" (you have the /i switch on, so it will match any case)
followed by one or more whitespace characters, followed by "teacher",
followed by one or more characters up to an opening <td. You then have a
negated character class, so it will match all text up to the next closing >,
and then another negated character class will match and capture anything up
to the next opening <.

I imagine this might be where your problem is. None of your match patterns
allow for zero occurrences, which means that there has to be at least one
character between the <td and closing >. In other words, your pattern would
never match <td>, but only something like <td class="foo">.

Moving on, you then have two non-greedy matches (.+?). The first will match
anything up to "address" and the second will match anything up to the next
<td. The regex then repeats itself with the two negated classes: one looking
for the end of the <td> and the other capturing everything up to the next
opening <. And once again, your pattern will fail unless there is at least
one character between the <td and >.

(I removed the /x from your original posting because it just allows
whitespace and comments in your regex, which didn't help the readability of
it, in my opinion of course.)

Matt


 
Reply With Quote
 
 
 
 
Geoff Cox
Guest
Posts: n/a
 
      12-07-2003
On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox
<(E-Mail Removed)> wrote:

I should have made things a bit clearer - so here is the whole code
and a sample of html which it is to work on .. can any one see why it
doesn't get the name and address info?!

Cheers

Geoff


My code is as follows but it does not work!

---------------------------
use strict;

print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(IN, "$namehtml");
open(OUT, ">>$newhtml");

my $line = <IN>;

while (defined($line=<IN>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

}

close (IN);
close (OUT);

-----------------------------

which is working on for example


<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>


Cheers

Geoff

 
Reply With Quote
 
Bob Walton
Guest
Posts: n/a
 
      12-07-2003
Geoff Cox wrote:

> On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox
> <(E-Mail Removed)> wrote:
>
> I should have made things a bit clearer - so here is the whole code
> and a sample of html which it is to work on .. can any one see why it
> doesn't get the name and address info?!
>
> Cheers
>
> Geoff
>
>
> My code is as follows but it does not work!


-------------------------------^^^^^^^^^^^^^
A much more specific description of what your code does/doesn't do it
called for in a newsgroup posting. Please state exactly what it does
that it shouldn't do, or what it doesn't do that it should do. "Doesn't
work" is next to meaningless -- we can't read your mind.


>
> ---------------------------
> use strict;


use warnings;


>
> print ("name of html file?\n");
> my $namehtml = <STDIN>;
>
> print ("name of email list file?\n");
> my $newhtml = <STDIN>;
>
>
> open(IN, "$namehtml");
> open(OUT, ">>$newhtml");
>
> my $line = <IN>;


Since you didn't modify $/, this will read only one line. I think
that's your fundamental problem. Try:

my $line;
{local $/;$line=<IN>} #slurp the input

and see if that works better.


>
> while (defined($line=<IN>)) {


Here you are reading the rest of the lines of filehandle IN, but one at
a time. You will have skipped the first line (which was read above).
If you slurp the input, you should get rid of the while loop.


> # if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
> # print OUT ("$1 \n");
> # }
>
> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
> .+?
> Address.+?<TD[^>]+>([^<]+)
> /isx ) {
> print OUT ("Name: $1\nAddress: $2\n");
> }
>
> }
>
> close (IN);
> close (OUT);
>
> -----------------------------
>
> which is working on for example
>
>
> <TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
> <TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
> <TR>
> <TD align=left width="20%" colSpan=2><B>Address</B></TD>
> <TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
> London N88 5XX</TD></TR>

....


> Geoff


Yes: you read the first line of your file, and throw it away. That was
the line with Teacher etc in it. But even if you didn't do that, the
remainder of the lines are read one at a time, and no one line contains
enough stuff to match your pattern. Slurp it all, and your pattern
might match. Here is a slightly modified standalone copy/paste/execute
style copy of your program that looks like it might "work":

use strict;
use warnings;
#print ("name of html file?\n");
#my $namehtml = <STDIN>;

#print ("name of email list file?\n");
#my $newhtml = <STDIN>;


#open(IN, "$namehtml");
#open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print ("Name: $1\nAddress: $2\n");
}

#}

#close (IN);
#close (OUT);

__END__
<TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
London N88 5XX</TD></TR>

HTH.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      12-07-2003
Geoff Cox wrote:
> here is the whole code and a sample of html which it is to work on


And, as I suspected, the problem has nothing to do with the regex...
Read Bob's explanation carefully!

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      12-07-2003
Matt Garrish wrote:
> Geoff Cox wrote:
>> this comes from my posting re how to match more than 1 line (from
>> Gunnar) but would appreciate any one just explaining what is
>> matched as the code does not work for me. If I could learn from
>> this I could probably sort it out for myself ..

>
> To break it down piece by piece:
>
> /Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is


<snip>

> I imagine this might be where your problem is. None of your match
> patterns allow for zero occurrences, which means that there has to
> be at least one character between the <td and closing >. In other
> words, your pattern would never match <td>, but only something like
> <td class="foo">.


Yeah, you are right, of course. Both the occurrences of

<TD[^>]+>

should better be

<TD[^>]*>

(But, as explained in other posts, that limitation was not the reason
why OP's code didn't "work".)

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      12-07-2003
On Sun, 07 Dec 2003 19:53:03 GMT, Bob Walton
<(E-Mail Removed)> wrote:

Bob,

many thanks for your thoughts - the following code gets the first set
of name/address data but stops at that point - 'afraid I haven't used
your form of slurp before and do not see how to move through the rest
of the file containing the name/address data?

Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (IN);
close (OUT);




>Geoff Cox wrote:
>
>> On Sun, 07 Dec 2003 18:02:07 GMT, Geoff Cox
>> <(E-Mail Removed)> wrote:
>>
>> I should have made things a bit clearer - so here is the whole code
>> and a sample of html which it is to work on .. can any one see why it
>> doesn't get the name and address info?!
>>
>> Cheers
>>
>> Geoff
>>
>>
>> My code is as follows but it does not work!

>
>-------------------------------^^^^^^^^^^^^^
>A much more specific description of what your code does/doesn't do it
>called for in a newsgroup posting. Please state exactly what it does
>that it shouldn't do, or what it doesn't do that it should do. "Doesn't
>work" is next to meaningless -- we can't read your mind.
>
>
>>
>> ---------------------------
>> use strict;

>
>use warnings;
>
>
>>
>> print ("name of html file?\n");
>> my $namehtml = <STDIN>;
>>
>> print ("name of email list file?\n");
>> my $newhtml = <STDIN>;
>>
>>
>> open(IN, "$namehtml");
>> open(OUT, ">>$newhtml");
>>
>> my $line = <IN>;

>
>Since you didn't modify $/, this will read only one line. I think
>that's your fundamental problem. Try:
>
> my $line;
> {local $/;$line=<IN>} #slurp the input
>
>and see if that works better.
>
>
>>
>> while (defined($line=<IN>)) {

>
>Here you are reading the rest of the lines of filehandle IN, but one at
>a time. You will have skipped the first line (which was read above).
>If you slurp the input, you should get rid of the while loop.
>
>
>> # if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
>> # print OUT ("$1 \n");
>> # }
>>
>> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
>> .+?
>> Address.+?<TD[^>]+>([^<]+)
>> /isx ) {
>> print OUT ("Name: $1\nAddress: $2\n");
>> }
>>
>> }
>>
>> close (IN);
>> close (OUT);
>>
>> -----------------------------
>>
>> which is working on for example
>>
>>
>> <TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
>> <TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
>> <TR>
>> <TD align=left width="20%" colSpan=2><B>Address</B></TD>
>> <TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
>> London N88 5XX</TD></TR>

>...
>
>
>> Geoff

>
>Yes: you read the first line of your file, and throw it away. That was
>the line with Teacher etc in it. But even if you didn't do that, the
>remainder of the lines are read one at a time, and no one line contains
>enough stuff to match your pattern. Slurp it all, and your pattern
>might match. Here is a slightly modified standalone copy/paste/execute
>style copy of your program that looks like it might "work":
>
>use strict;
>use warnings;
>#print ("name of html file?\n");
>#my $namehtml = <STDIN>;
>
>#print ("name of email list file?\n");
>#my $newhtml = <STDIN>;
>
>
>#open(IN, "$namehtml");
>#open(OUT, ">>$newhtml");
>
>my $line;
>{local $/;$line = <DATA>} #slurp the file
>
>#while (defined($line=<DATA>)) {
># if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
># print OUT ("$1 \n");
># }
> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
> .+?
> Address.+?<TD[^>]+>([^<]+)
> /isx ) {
> print ("Name: $1\nAddress: $2\n");
> }
>
>#}
>
>#close (IN);
>#close (OUT);
>
>__END__
> <TD align=left width="20%" colSpan=2><B>Head Teacher</B></TD>
><TD vAlign=top width="80%" colSpan=2>Fred Green</TD></TR>
><TR>
><TD align=left width="20%" colSpan=2><B>Address</B></TD>
><TD vAlign=top width="80%" colSpan=2>Park Road, Northgate,
>London N88 5XX</TD></TR>
>
>HTH.


 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      12-07-2003
On Sun, 07 Dec 2003 21:01:48 +0100, Gunnar Hjalmarsson
<(E-Mail Removed)> wrote:

>Geoff Cox wrote:
>> here is the whole code and a sample of html which it is to work on

>
>And, as I suspected, the problem has nothing to do with the regex...
>Read Bob's explanation carefully!


Gunnar

must be almost there - I have posted my version based on Bob's code
.... but it only gets the first name/address info - not clear how I
move through the rest of the file?

by the way - your code seems to work fine minus my suggestion re the
additional < ?!

Cheers

Geoff

 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      12-07-2003
On Sun, 7 Dec 2003 14:24:23 -0500, "Matt Garrish"
<(E-Mail Removed)> wrote:

>
>"Geoff Cox" <(E-Mail Removed)> wrote in message
>news:(E-Mail Removed).. .
>> Hello,
>>
>> this comes from my posting re how to match more than 1 line (from
>> Gunnar) but would appreciate any one just explaining what is matched
>> as the code does not work for me. If I could learn from this I could
>> probably sort it out for myself ..
>>
>>

>
>To break it down piece by piece:


Matt,

many thanks - will read in a minute - but you might like to look at
following code - thsi works OK except that it only gets the first set
of name/address data - I do not see at the moment how to move along
the slurped input to get the other sets of name/address info ..? any
ideas?! Cheers Geoff

use strict;
use warnings;
print ("name of html file?\n");
my $namehtml = <STDIN>;

print ("name of email list file?\n");
my $newhtml = <STDIN>;


open(DATA, "$namehtml");
open(OUT, ">>$newhtml");

my $line;
{local $/;$line = <DATA>} #slurp the file

#while (defined($line=<DATA>)) {
# if ($line =~ /&nbsp;&nbsp;(.*?)<\/H6>/i) {
# print OUT ("$1 \n");
# }
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print OUT ("Name: $1\nAddress: $2\n");
}

#}

close (DATA);
close (OUT);




>
>/Head\s+Teacher.+?<TD[^>]+>([^<]+).+?Address.+?<TD[^>]+>([^<]+)/is
>
>matches "head" (you have the /i switch on, so it will match any case)
>followed by one or more whitespace characters, followed by "teacher",
>followed by one or more characters up to an opening <td. You then have a
>negated character class, so it will match all text up to the next closing >,
>and then another negated character class will match and capture anything up
>to the next opening <.
>
>I imagine this might be where your problem is. None of your match patterns
>allow for zero occurrences, which means that there has to be at least one
>character between the <td and closing >. In other words, your pattern would
>never match <td>, but only something like <td class="foo">.
>
>Moving on, you then have two non-greedy matches (.+?). The first will match
>anything up to "address" and the second will match anything up to the next
><td. The regex then repeats itself with the two negated classes: one looking
>for the end of the <td> and the other capturing everything up to the next
>opening <. And once again, your pattern will fail unless there is at least
>one character between the <td and >.
>
>(I removed the /x from your original posting because it just allows
>whitespace and comments in your regex, which didn't help the readability of
>it, in my opinion of course.)
>
>Matt
>


 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      12-07-2003
Geoff Cox wrote:
> Bob,
>
> many thanks for your thoughts - the following code gets the first
> set of name/address data but stops at that point - 'afraid I
> haven't used your form of slurp before and do not see how to move
> through the rest of the file containing the name/address data?


Well, you haven't told us before that there are more than one
name/address pair.

> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)


Try to change that to

while ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
----^^^^^

> /isx ) {


and that to

/gisx ) {
-------------------^

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can some one please explain this code raghukumar C++ 13 11-17-2007 02:31 PM
explain this one to me please > electrical problem Homeworker Computer Support 8 10-16-2006 11:52 AM
Can some one Explain How To Find File and Leech them Accully How to find all Articals Related to that binary callejachris@tpg.com.au Computer Support 2 12-04-2005 02:15 PM
Can someone explain this? I appreciate some help. Thank You. Shapper ASP .Net 3 06-10-2005 12:51 AM
Could one of you fine experst explain this one? Richard HTML 7 01-26-2004 05:03 PM



Advertisments