Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > capturing multiple patterns per line

Reply
Thread Tools

capturing multiple patterns per line

 
 
ccc31807
Guest
Posts: n/a
 
      02-05-2010
This is a newbie question, I admit, but I don't know the answer.

Suppose I am parsing a file line by line, and I want to push to an
array all substrings on that line that match a pattern. For example,
consider the listing below. @urls SHOULD contain this: @urls = (http://
google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
Instead, it contains only the last value. Using the g modifier doesn't
help.

I know why @urls contains only the last value, but I don't know how to
get all the values.

Thanks, CC.

-------listing---------------
use strict;
use warnings;

my @urls;
while (<DATA>)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}

print @urls;
exit(0);

__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a href="http://google.com">Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      02-05-2010
ccc31807 <> wrote:
>This is a newbie question, I admit, but I don't know the answer.
>
>Suppose I am parsing a file line by line, and I want to push to an
>array all substrings on that line that match a pattern. For example,
>consider the listing below. @urls SHOULD contain this: @urls = (http://
>google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
>Instead, it contains only the last value. Using the g modifier doesn't
>help.
>
>I know why @urls contains only the last value, but I don't know how to
>get all the values.


Cannot repro your problem. The code you posted adds all three URLs to
the array and prints them in one contiguous line.

C:\tmp>t.pl
http://google.comhttp://amazon.comhttp://ebay.com

jue
 
Reply With Quote
 
 
 
 
ccc31807
Guest
Posts: n/a
 
      02-05-2010
On Feb 5, 11:30*am, Jürgen Exner <jurge...@hotmail.com> wrote:
> Cannot repro your problem. The code you posted adds all three URLs to
> the array and prints them in one contiguous line.
>
> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com


This is a mystery. I've run the script on both a Windows and Linux
machine with the same results. Besides, your output should also
include Yahoo, which it doesn't.

I was able to do what I wanted with the following hack. I'm not real
happy about it, but it works. Still, I'd rather know how to do it with
a RE.

CC.

---------hack---------------
while (<DATA>)
{
my @line = split /<a/;
foreach my $url (@line)
{
if (/<a.*href="([^"]+)/) { push @urls, $1; }
}
}
 
Reply With Quote
 
Willem
Guest
Posts: n/a
 
      02-05-2010
ccc31807 wrote:
) This is a newbie question, I admit, but I don't know the answer.
)
) Suppose I am parsing a file line by line, and I want to push to an
) array all substrings on that line that match a pattern. For example,
) consider the listing below. @urls SHOULD contain this: @urls = (http://
) google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
) Instead, it contains only the last value. Using the g modifier doesn't
) help.
)
) I know why @urls contains only the last value, but I don't know how to
) get all the values.

I think you don't actually know why it only contains the last value,
because there are two separate issues with your code.

) Thanks, CC.
)
) -------listing---------------
) use strict;
) use warnings;
)
) my @urls;
) while (<DATA>)
) {
) if (/<a.*href="([^"]+)/) { push @urls, $1; }
) }

First of all, the .* in there will match everything, so in this case it
will match everything from the first <a to the last href="..."

Second, with the /g modifier, the results will not all be put in $1

And third, obviously, this is a lot easier in perl if you realise that it
can do a lot of set processing:

while (<DATA>)
{
push @urls, /<a.*?href="(.*?)"/gi;
}

Or even:

@urls = map { /<a.*?href="(.*?)"/gi } <DATA>

Although that is a lot more memory hungry.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      02-05-2010
On Feb 5, 11:58*am, Willem <wil...@stack.nl> wrote:
> * while (<DATA>)
> * {
> * * push @urls, /<a.*?href="(.*?)"/gi;
> * }


Yes, yes, yes, you are entirely right. I thought that the non-greedy
modifier might do the trick, but (1) I didn't realize that the greedy
version would skip all the way to the last one to the detriment of my
search, and (2) I didn't carefully think through exactly where I
should use the non-greedy modifier.

Thanks, CC.
 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      02-05-2010
ccc31807 wrote:
> On Feb 5, 11:30 am, Jürgen Exner <jurge...@hotmail.com> wrote:
>> Cannot repro your problem. The code you posted adds all three URLs to
>> the array and prints them in one contiguous line.
>>
>> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

>
> This is a mystery. I've run the script on both a Windows and Linux
> machine with the same results. Besides, your output should also
> include Yahoo, which it doesn't.
>
> I was able to do what I wanted with the following hack. I'm not real
> happy about it, but it works. Still, I'd rather know how to do it with
> a RE.
>
> ---------hack---------------
> while (<DATA>)
> {
> my @line = split /<a/;
> foreach my $url (@line)
> {
> if (/<a.*href="([^"]+)/) { push @urls, $1; }


That is short for:

if ($_ =~ /<a.*href="([^"]+)/)

So you are not using the results from split() at all and the foreach
loop is superfluous. But if you changed that to:

if ($url =~ /<a.*href="([^"]+)/)

Then it wouldn't work because "split /<a/" removes the string '<a' from
all input and the regular expression requires a match with '<a'.

> }
> }




John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      02-05-2010
ccc31807 <> wrote:
>On Feb 5, 11:30*am, Jürgen Exner <jurge...@hotmail.com> wrote:
>> Cannot repro your problem. The code you posted adds all three URLs to
>> the array and prints them in one contiguous line.
>>
>> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

>
>This is a mystery. I've run the script on both a Windows and Linux
>machine with the same results. Besides, your output should also
>include Yahoo, which it doesn't.


After reading the other responses I realize that I was looking at the
wrong problem. You wrote "Instead, it contains only the last value. "
Running your code I saw three distinct values. Three is more than "only
the last", so obviously your claim was wrong.
You never mentioned that you were talking about the RE not
extracting/capturing all the elements from a _SINGLE(!!!)_ line.

Thank you very much for throwing red herring around.

jue
 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      02-05-2010
On 05/02/2010 16:56, ccc31807 wrote:
> On Feb 5, 11:30 am, Jürgen Exner<jurge...@hotmail.com> wrote:
>> Cannot repro your problem. The code you posted adds all three URLs to
>> the array and prints them in one contiguous line.
>>
>> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

>
> This is a mystery. I've run the script on both a Windows and Linux
> machine with the same results. Besides, your output should also
> include Yahoo, which it doesn't.


Thats because your DATA lines have been reformatted and split onto
several lines!

>
> I was able to do what I wanted with the following hack. I'm not real
> happy about it, but it works. Still, I'd rather know how to do it with
> a RE.


Not every job should be done with an RE

>
> ---------hack---------------
> while (<DATA>)
> {
> my @line = split /<a/;
> foreach my $url (@line)
> {
> if (/<a.*href="([^"]+)/) { push @urls, $1; }
> }
> }


-------------8<-------------
#!/usr/bin/perl
use strict;
use warnings;
my @urls;
while (<DATA>)
{
push @urls, /<a href="([^"]+)/g;
}
print join(',',@urls), "\n";
__DATA__
xxx
x <a href="g">G</a><a href="y">Y</a> x
x <a href="a">A</a><a href="e">E</a> x
xxx
-------------8<-------------
g,y,a,e
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      02-05-2010
On Feb 5, 1:50*pm, RedGrittyBrick <RedGrittyBr...@spamweary.invalid>
wrote:
> Thats because your DATA lines have been reformatted and split onto
> several lines!


Yeah, I saw that before I posted, which is why I use '\n' to mark the
ends of the 'real' lines.


> Not every job should be done with an RE


No, but in accord with TIMTOWTDI, I wanted to see how it could be done
with an RE.

> while (<DATA>)
> {
> * * push @urls, /<a href="([^"]+)/g;}
>
> print join(',',@urls), "\n";


I'm having fun playing with the suggestions offered, and am actually
learning in the process.

Thanks, CC.
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      02-06-2010
On Fri, 5 Feb 2010 08:17:05 -0800 (PST), ccc31807 <> wrote:

>This is a newbie question, I admit, but I don't know the answer.
>
>Suppose I am parsing a file line by line, and I want to push to an
>array all substrings on that line that match a pattern. For example,
>consider the listing below. @urls SHOULD contain this: @urls = (http://
>google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
>Instead, it contains only the last value. Using the g modifier doesn't
>help.
>
>I know why @urls contains only the last value, but I don't know how to
>get all the values.
>
>Thanks, CC.
>
>-------listing---------------
>use strict;
>use warnings;
>
>my @urls;
>while (<DATA>)
>{
> if (/<a.*href="([^"]+)/) { push @urls, $1; }
>}
>
>print @urls;
>exit(0);
>
>__DATA__
><html>\n
><body>\n
><h1>My Favorite Sites</h1>\n
><p>\n
>My favorite sites are <a href="http://google.com">Google</a>, <a
>href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
>a>, and <a href="http://ebay.com">Ebay</a>.\n
></p>\n
></body>\n
></html>\n


If you want to parse with a little bit more conformity,
something like this (albeit deficient) might work better
when you come across possible gotcha's.

-sln

use strict;
use warnings;

my @urls;
{
local $/;
@urls = <DATA> =~
/<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg;

# Or, if you want to be more precise and don't mind the quotes:
#/<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg
}

print $_,"\n" for @urls;
exit(0);

__DATA__
<html>\n
<body>\n
<h1>My Favorite Sites</h1>\n
<p>\n
My favorite sites are <a asdfhref=http://google.com" href='http://gg.com' >Google</a>, <a
href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
a>, and <a href="http://ebay.com">Ebay</a>.\n
</p>\n
</body>\n
</html>\n
---------
http://gg.com
http://yahoo.com
http://amazon.com
http://ebay.com

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
read file with multiple data per line Eduardo Python 4 04-15-2009 12:03 AM
Capturing the output of an external program line by line Aditya Mahajan Ruby 4 10-14-2007 11:28 PM
Read a file line by line with a maximum number of characters per line Hugo Java 10 10-18-2004 11:42 AM
where to find good patterns and sources of patterns (was Re: singletons) crichmon C++ 4 07-07-2004 10:02 PM
Perl DNS reverse lookups -- multiple IP addresses per line Maynard Perl Misc 4 06-23-2004 07:48 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57