Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Q on regex of LWP::Simple data

Reply
Thread Tools

Q on regex of LWP::Simple data

 
 
Len Philpot
Guest
Posts: n/a
 
      03-02-2007
I've read the FAQs (unless proven otherwise!) and examples, etc. but
don't know why this doesn't work...


#!perl # use your shebang of choice, this was on Windows

use warnings;
use strict;
use LWP::Simple;

# unwrap this line
my @cachepage = \
get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');

# line in question (in @cachepage) looks like :
# <p><span id="ShortDescription">Should be quick and easy.</span></p>

foreach my $line (@cachepage)
{
if($line =~ /Should be quick/)
{
print("$line");
}
}


Instead of printing only the line that contains "Should be quick", it
prints every line. Breaking it down to a minimum, I tried :

#!perl

use warnings;
use strict;

my @a = qw(one two three four five fiver);

foreach my $line (@a)
{
if($line =~ /five/)
{
print("$line\n");
}
}

Which, of course, prints :

five
fiver

.... as expected. What's different except maybe the input data? Are the
tags throwing a wrench in things?

My apologies in advance if this is a FAQ or simple logical error. I'm
very much in learning mode with Perl these days.

Thanks!
--

---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
 
Reply With Quote
 
 
 
 
Len Philpot
Guest
Posts: n/a
 
      03-02-2007
On Fri, 02 Mar 2007 16:02:12 +1100, Iain Chalmers wrote:

> In article <18p2wnf2hexfv$(E-Mail Removed)>,
> Len Philpot <(E-Mail Removed)> wrote:
>
>> my @cachepage = \
>> get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
>>

>
> I don't think @cachepage contains what you think it contains...
>
> try adding:
>
> use Data:umper;
> print Dumper \@cachepage;
>
> after that line.


So, it's one long string now... $#cachepage == 1

What's the best way to break it back up again? Maybe a pointer in the
right direction?

The get() example used a scalar instead of an array, but I wanted to
iterate through it to find a number of specific strings. Maybe I need to
come up with a regex to simply extract what I need all at once without
iterating.

Or am I looking at this wrong? My final objective, more or less, is to
retrieve a file from a website and extract two or three specific strings
from it, located via a couple of specific HTML tags and subsequently
extracted using back references, but I'm not there yet.

Perhaps I'm being dense... After all, it /has/ been a very long
DST-fix-infested day

Thanks.
--

---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
 
Reply With Quote
 
 
 
 
Len Philpot
Guest
Posts: n/a
 
      03-02-2007
On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:

> In article <epdxww5gfd0l.5psw3jcl6it4$(E-Mail Removed)>,
> Len Philpot <(E-Mail Removed)> wrote:
>> Or am I looking at this wrong?

>
> Yep. LWP::Simple::get doesn't return an array of lines no matter _how_
> much you want it too.
>
> Either split the scalar you get into an array of lines yourself
>
> @cachepage=split(/\n/,$scalar_version_ofOcachepage);
>
> or throw the whole scalar at an appropriate regex.


That's what I thought about after posting.


> Unless the file you're getting is very well defined, the usual advice is
> to parse html using an html parser. Regexs are not the right tool to
> deal with arbitrary html (though your case might be far enough from
> "arbitrary html" that regexs will work for you).


At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! , so I'll probably take an
incrementally-complex approach to parsing it. This whole exercise is for
my own use and edification, anyway.

Thanks.
--

---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
 
Reply With Quote
 
gf
Guest
Posts: n/a
 
      03-02-2007
On Mar 2, 6:15 am, Len Philpot <(E-Mail Removed)> wrote:

> At this point, I'm very low on the Perl learning cliff (oh, for the
> simplicity and clarity of C! , so I'll probably take an
> incrementally-complex approach to parsing it. This whole exercise is for
> my own use and edification, anyway.


Ok. I think you meant "curve" instead of "cliff"...

And "the simplicity and clarity of C"? Perl and C are so similar as
far as their allowing the programmer to write terse and cryptic code,
or very verbose code, and still maintain speed. It's the programmers
choice and not something enforced by the language. That said...

The problem with finding strings or data in HTML pages is the
variablity of the format of the pages. HTML is unstructured and relies
on the browser to turn the data into human-readable form. For our
purposes as programmers it makes our job more difficult because we
want to grab the easiest tool to do the job and regex seems to be the
tool to handle finding data in lines that change.

The problem is that HTML allows arbitrary line breaks in the file and
the browser will gobble them then parse the page then format it for
us. Perl doesn't do that. It's doing what you told it to (usually)
and, in this case, what you told it to do is not nearly as complex as
what the browser is doing.

You can get closer to what the browser is doing by stripping all the
line-end characters from the document, then applying your regex
pattern reiteratively to the resulting single line, OR you can tell
the regex engine to ignore line-ends for you. Check out the 'm' and
's' options to regex. Combined with 'g' you should be homing in on the
data you want. Usually.

Sometimes those are still going to fail so you have to dig out the big
guns and parse the document like a browser. There's HTML:arser and
various derived modules. Of those I like HTML::TreeBuilder. Pass it
HTML using

my $t = HTML::TreeBuilder->new_from_content(get('your url'));

and it will parse it and build a tree. It'll lock the tree and turn it
into an HTML::Element object which you can search and extract info
using the methods of that object. Of those I like the 'look_down()'
method because it's so flexible. Give it the right parameters and
it'll let you loop through the page and find whatever you want. Of
course, as always you have to tell it correctly, and that can be a
tough thing to determine, but that's a different subject for a
different time and probably a different group.

Another way to attack the same problem is to use the various xpath
implementations for HTML in Perl. Search on CPAN and you'll find some.
xpath is a cool way of looking at HTML but, at least for me, it's not
as intuitive as how TreeBuilder and the parsers do it.

 
Reply With Quote
 
Len Philpot
Guest
Posts: n/a
 
      03-02-2007
On 2 Mar 2007 10:16:32 -0800, gf wrote:

> On Mar 2, 6:15 am, Len Philpot <(E-Mail Removed)> wrote:
>
>> At this point, I'm very low on the Perl learning cliff (oh, for the
>> simplicity and clarity of C! , so I'll probably take an
>> incrementally-complex approach to parsing it. This whole exercise is for
>> my own use and edification, anyway.

>
> Ok. I think you meant "curve" instead of "cliff"...
>
> And "the simplicity and clarity of C"? Perl and C are so similar as
> far as their allowing the programmer to write terse and cryptic code,
> or very verbose code, and still maintain speed. It's the programmers
> choice and not something enforced by the language. That said...


Actually, 'cliff' was intentional, as was the C reference - A weak
attempt at humor, I guess. I'm just trying to come to terms with the
looseness that Perl allows (although doesn't require). It's purely my
preference : I like algorithmic flexibility, but with a tighter
syntactic regimen, i.e., for me TIMTOWTDI gets in the way of learning
"the best/right way to do X". However, I'm sure its's very different for
others (as is obviously the case). I really like the way C is not as
abstracted - "the machine prints through" - but once again that's my
preference. Lots of very knowledgeable people feel differently.


> The problem with finding strings or data in HTML pages is the
> variablity of the format of the pages. HTML is unstructured and relies
> on the browser to turn the data into human-readable form. For our
> purposes as programmers it makes our job more difficult because we
> want to grab the easiest tool to do the job and regex seems to be the
> tool to handle finding data in lines that change.


Fortunately in this case, what I'm looking for is (AFAICT) uniquely
labeled and fairly contained. However, newlines do occur and I'll haev
to deal with that.


> Sometimes those are still going to fail so you have to dig out the big
> guns and parse the document like a browser. There's HTML:arser and
> various derived modules. Of those I like HTML::TreeBuilder. Pass it
> HTML using
>
> my $t = HTML::TreeBuilder->new_from_content(get('your url'));


Thanks for the suggestions - I'll take a look at them.
--

---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      03-02-2007
Len Philpot wrote:

> # unwrap this line
> my @cachepage = \
> get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
> # line in question (in @cachepage) looks like :
> # <p><span id="ShortDescription">Should be quick and easy.</span></p>
> foreach my $line (@cachepage)
> {
> if($line =~ /Should be quick/)
> {
> print("$line");
> }
> }
>
>
> Instead of printing only the line that contains "Should be quick", it
> prints every line.


After reading all the really good advice
given to yu by others here, i'd like
to point you in the direction mentioned
by Iain.

The minimum working solution for your
question "w/appropriate regex" would
therefore be:


...
my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
my $searchstr = 'Should be quick';

if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
print "$1\n"
}
...


I read you are/have been a C programmer (as I am),
I'd like to stress the idea you should *really* try
to get somehow into the "regex metalanguage" because
knowing it would have enabled you to spit out a solution
after learning what "LWP::Simple::get" returns.

The Regex modifier /m (http://www.perl.com/doc/manual/html/pod/perlre.html)
does exaclty what you need here, it 'anchors' the expression
in parentheses (.*?$searchstr.*?) between line start and line end.

The conntent of the (first and only) parentheses will then
be available in the pattern match variable $1.

Regards

Mirco
 
Reply With Quote
 
Len Philpot
Guest
Posts: n/a
 
      03-02-2007
On Fri, 02 Mar 2007 20:52:15 +0100, Mirco Wahab wrote:

> The minimum working solution for your
> question "w/appropriate regex" would
> therefore be:
>
> ...
> my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
> my $searchstr = 'Should be quick';
>
> if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
> print "$1\n"
> }
> ...
>
> I read you are/have been a C programmer (as I am),


Let me clarify - I find C fascinating and have played with it off and on
over the years. I hesitate to call myself a programmer in any language,
much less C (and it's been a while since I spent any serious time with
it), but I do find it very interesting. I'm not a programmer by
profession... although in the strictest sense of the term, I /have/ been
technically paid to write a couple of programs.


> I'd like to stress the idea you should *really* try
> to get somehow into the "regex metalanguage" because


Absolutely. I'm a Solaris admin by day, so I use them here and again,
although I need to make an effort to learn it beyond just what I use on
the job.


> The conntent of the (first and only) parentheses will then
> be available in the pattern match variable $1.


That's what I had in mind (and have done, temporarily): to use a back
reference to grab what I need. The string I used above was a test case.
Actually I look for a specific set of tags followed by a specific HTML
ID value, which are hardwired in the regex, followed by the back
referenced payload.

Thanks.
--

---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
 
Reply With Quote
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      03-03-2007
Len Philpot <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
>
> > In article <epdxww5gfd0l.5psw3jcl6it4$(E-Mail Removed)>,
> > Len Philpot <(E-Mail Removed)> wrote:


> At this point, I'm very low on the Perl learning cliff (oh, for the
> simplicity and clarity of C! ,


As in chasing macros and typedefs through header files? As in
Duff's device?

Nah, C is a fine programming language. It is *smaller* than Perl,
in that Perl has more constructs and concepts to learn, but taken
individually, Perl's constructs and concepts are no more difficult
than C's.

Anno
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      03-04-2007
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de <(E-Mail Removed)-berlin.de> wrote:
> Len Philpot <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>> On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
>>
>> > In article <epdxww5gfd0l.5psw3jcl6it4$(E-Mail Removed)>,
>> > Len Philpot <(E-Mail Removed)> wrote:

>
>> At this point, I'm very low on the Perl learning cliff (oh, for the
>> simplicity and clarity of C! ,

>
> As in chasing macros and typedefs through header files? As in
> Duff's device?
>
> Nah, C is a fine programming language. It is *smaller* than Perl,
> in that Perl has more constructs and concepts to learn, but taken
> individually, Perl's constructs and concepts are no more difficult
> than C's.



Except for the concept of scalar and list context.

Did Larry borrow that concept from somewhere, or did it first
show up in Perl?


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      03-04-2007
Tad McClellan <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> (E-Mail Removed)-berlin.de <(E-Mail Removed)-berlin.de> wrote:
> > Len Philpot <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> >> On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
> >>
> >> > In article <epdxww5gfd0l.5psw3jcl6it4$(E-Mail Removed)>,
> >> > Len Philpot <(E-Mail Removed)> wrote:

> >
> >> At this point, I'm very low on the Perl learning cliff (oh, for the
> >> simplicity and clarity of C! ,

> >
> > As in chasing macros and typedefs through header files? As in
> > Duff's device?
> >
> > Nah, C is a fine programming language. It is *smaller* than Perl,
> > in that Perl has more constructs and concepts to learn, but taken
> > individually, Perl's constructs and concepts are no more difficult
> > than C's.

>
>
> Except for the concept of scalar and list context.
>
> Did Larry borrow that concept from somewhere, or did it first
> show up in Perl?


I'm pretty sure Perl is the first major language to implement anything
similar. It's one of the few features that are original with Perl.

If anything, interpretation and propagation of context is Perl's answer
to the inflexible typing systems of other languages, but it goes far
beyond that.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
Java regex imposture re: Perl regex compatibility a_c_Attlee@yahoo.com Java 2 05-06-2005 12:16 AM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments