Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   HTML::Parser (http://www.velocityreviews.com/forums/t887775-html-parser.html)

Zebee Johnstone 08-26-2004 05:00 AM

HTML::Parser
 
Are there any tutorials or explanations of HTML::Parser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....

I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!





#!/usr/bin/perl -w

# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.

use HTML::Parser;

my $p = HTML::Parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);


$p->parse_file(shift || die) || die $!;

sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";

$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}

sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "[IMG]");
}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";

$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}


Zebee

--
Zebee Johnstone (zebee@zip.com.au), proud holder of
aus.motorcycles Poser Permit #1.
"Motorcycles are like peanuts... who can stop at just one?"

Tassilo v. Parseval 08-26-2004 05:46 AM

Re: HTML::Parser
 
Also sprach Zebee Johnstone:

> Are there any tutorials or explanations of HTML::Parser?
>
> I've read the perldoc and I don't understand it. It's gibberish to me.
>
> I've looked at the examples, but using them is cargo cult programming at
> its worst, I have no idea what they are doing and why.
>
> I understand I create an object. I understand I can then use this to do
> things, but as soon as it talks about handlers, it loses me.


One problem with HTML::Parser appears to be its two available
interfaces. The description of the provided methods in the perldocs
isn't always quite clear about which API version a method relates to.

Maybe

<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

will help you. It deals with the old interface (subclassing) which I
find more convenient and easier to use.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval

Zebee Johnstone 08-27-2004 01:38 AM

Re: HTML::Parser
 
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote:
>
>
> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>
>
> will help you. It deals with the old interface (subclassing) which I
> find more convenient and easier to use.


Thanks!


Zebee

Zebee Johnstone 08-27-2004 02:47 AM

Re: HTML::Parser
 
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote:
>
> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>


I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

Is there a way to associate the tag text with the tag, and only
use that?

So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine. IN the examples, it seems to use
(text => [], '@{dtext}' ) as args to the text handler, but I've no
idea where those come from at all, or what they are, or how to use them.
I have the "$self" object, which I can pass to a subroutine but no idea
how to get the things I need from it.

Zebee

Eric Bohlman 08-27-2004 04:06 AM

Re: HTML::Parser
 
Zebee Johnstone <zebee@zip.com.au> wrote in
news:slrncit7rb.2nn.zebee@zeus.zipworld.com.au:

> Is there a way to associate the tag text with the tag, and only
> use that?


You might want to try HTML::TokeParser instead (it's included with the
HTML::Parser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.

Zebee Johnstone 08-27-2004 05:27 AM

Re: HTML::Parser
 
In comp.lang.perl.misc on 27 Aug 2004 04:06:58 GMT
Eric Bohlman <ebohlman@omsdev.com> wrote:
> You might want to try HTML::TokeParser instead (it's included with the
> HTML::Parser distribution). It's a "pull" parser rather than a "push" one;
> rather than it calling your code in response to tags and text, you call it
> to get the next "token" which can be a start tag, text, end tag, etc. and
> then decide what to do with it. Using it is similar to reading through a
> file in a loop.



Bingo! Much easier to use and understand. Thanks.

Zebee

--
Zebee Johnstone (zebee@zip.com.au), proud holder of
aus.motorcycles Poser Permit #1.
"Motorcycles are like peanuts... who can stop at just one?"

Tassilo v. Parseval 08-27-2004 05:41 AM

Re: HTML::Parser
 
Also sprach Zebee Johnstone:

> In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
> Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote:
>>
>> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

>
> I understand more now about it, but your tutorial doesn't cover the
> text, which I need.
>
> If I print out all the text elements:
>
> sub text {
> my($self, $origtext, $is_cdata) = @_;
> print "text [$origtext] \n";
> }
>
> then I get the text associated with the tags I'm after, but I get a lot
> of other text as well.


More specifically, you get all the plain text elements of the HTML file.

> Is there a way to associate the tag text with the tag, and only
> use that?


Yes, by keeping track in which tag the parser currently is.

> So a bit of HTML
> <a href="http://www.google.com"> Google </a> would have "Google"
> associated with "http://www.google.com"?
>
> ideally, I'd like to call the text subroutine from the start subroutine,
> and pass it a hash to put the text value in. And have it return that
> hash.


Those are handlers and they can't have such a return value. But you have
an object (the HTML::Parser object) in which you can store the data:

#!/usr/bin/perl -w

package MyParser;

use strict;
use base qw/HTML::Parser/;

sub start {
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };
$self->{in_a} = $attr->{ href };
}
}

sub end {
my ($self, $tagname) = @_;
delete $self->{in_a} if $tagname eq 'a';
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;
}
}

package main;

use Data::Dumper;
my $html = <<EOHTML;
<html>
<body>
<a href="http://www.first.com" target="bla">First link</a>
<a href="http://www.second.com">Second link</a>
</body>
</html>
EOHTML

my $p = MyParser->new;
$p->parse($html);
print Dumper $p->{a};
__END__
$VAR1 = [
{
'http://www.first.com' => 'One link'
},
{
'http://www.second.com' => 'Second link'
}
];

> It isn't clear to me what items the start subroutine knows about that
> it can pass to the text subroutine.


Handlers don't call each other. It's HTML::Parser's parse-routines that
call the handlers whenever they encounter a start or end tag, a text
block or a comment. Handlers are called as-soon-as-event-happens.

> IN the examples, it seems to use (text => [], '@{dtext}' ) as args to
> the text handler, but I've no idea where those come from at all, or
> what they are, or how to use them. I have the "$self" object, which I
> can pass to a subroutine but no idea how to get the things I need from
> it.


This $self object is the object you create with 'HTML::Parser->new'. Per
default it doesn't contain useful information. It holds the state of the
parser. But, as show above, you can abuse it as a cheap way of keeping
your own states. All I did was injecting two new member variables into
the object: $self->{in_a} which holds the URL when being inside an <a>
tag, otherwise this field does not exist. It is deleted in the
end-handler when $tagname is 'a'.

The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval

Zebee Johnstone 08-27-2004 06:54 AM

Re: HTML::Parser
 

Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote:
> my ($self, $tagname, $attr) = @_;
> if ($tagname eq 'a') {
> # store the URL as key of a new hash-ref
> # associated text not yet known, therefore undef
> push @{ $self->{a} }, { $attr->{href} => undef };


OK, given your explanation below, I think I get this.

> sub text {
> my ($self, $text) = @_;
> if (exists $self->{in_a}) {
> # text is between <a> and </a>
> $self->{a}->[-1]->{ $self->{in_a} } = $text;


Why -1? I don't understand this line at all...
>
> The second one is $self->{a}. This one is an array-ref of
> hash-references. Each new URL/text pair is recorded in there and pushed
> onto this array.
>
> When '$p->parse' returns you look at '$p->{a}' and there you have the
> data you want to extract.
>


Zebee

Tassilo v. Parseval 08-27-2004 07:39 AM

Re: HTML::Parser
 
Also sprach Zebee Johnstone:

> Bear with me please, I'm still getting to grips with a lot of notation
> and ideas...
>
> If that means I need to go read something to understand, please point me
> at it!


Your question is mostly about the data-structure that is used here. So
that would make it a perldsc/perlreftut/perlref-question.

> In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
> Tassilo v. Parseval <tassilo.von.parseval@rwth-aachen.de> wrote:
>> my ($self, $tagname, $attr) = @_;
>> if ($tagname eq 'a') {
>> # store the URL as key of a new hash-ref
>> # associated text not yet known, therefore undef
>> push @{ $self->{a} }, { $attr->{href} => undef };

>
> OK, given your explanation below, I think I get this.
>
>> sub text {
>> my ($self, $text) = @_;
>> if (exists $self->{in_a}) {
>> # text is between <a> and </a>
>> $self->{a}->[-1]->{ $self->{in_a} } = $text;

>
> Why -1? I don't understand this line at all...


Previously I did this:

push @{ $self->{a} }, { $attr->{href} => undef };

This means: $self->{a} is an array-reference. The hash-reference

{ $attr->{href} => undef }

is pushed onto this array-ref which means it is now the last element.

However, the hash-ref is incomplete. The value associated with they key
$attr->{href} is undef because we can't yet know the text enclosed in
<a> and </a>. But later we will (namely in the text() handler).

Once text is called, it's checked that we are inside <a>|</a>. If we
are, we finally have the text portion we wanted. We know that the
incomplete hash-reference is the last element in @{ $self->{a} }. And so
it becomes:

$self->{a}->[-1]

which is our previously created hash-reference. Only the value is
updated. The key was stored in $self->{in_a}:

$self->{a}->[-1]->{ $self->{in_a} } = $text;

I admit that the data-structure I used is not ideal. If you are sure
that the URLs defined in <a> tags are unique, you can do away with the
array-ref altogether:

sub start {
my ($self, $tag, $attr) = @_;
if ($tag eq 'a') {
$self->{in_a} = $attr->{href};
}
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
$self->{a}->{ $self->{in_a} } = $text;
delete $self->{in_a};
}
}

We didn't need the end-handler as I just realized. We can also delete
$self->{in_a} in text().

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval

Bart Lateur 08-27-2004 11:48 AM

Re: HTML::Parser
 
Zebee Johnstone wrote:

>Are there any tutorials or explanations of HTML::Parser?
>
>I've read the perldoc and I don't understand it. It's gibberish to me.


The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.

--
Bart.


All times are GMT. The time now is 05:21 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.