Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > HTML::Parser

Reply
Thread Tools

HTML::Parser

 
 
Zebee Johnstone
Guest
Posts: n/a
 
      08-26-2004
Are there any tutorials or explanations of HTML:arser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....

I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!





#!/usr/bin/perl -w

# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.

use HTML:arser;

my $p = HTML:arser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);


$p->parse_file(shift || die) || die $!;

sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";

$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}

sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "[IMG]");
}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";

$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}


Zebee

--
Zebee Johnstone ((E-Mail Removed)), proud holder of
aus.motorcycles Poser Permit #1.
"Motorcycles are like peanuts... who can stop at just one?"
 
Reply With Quote
 
 
 
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      08-26-2004
Also sprach Zebee Johnstone:

> Are there any tutorials or explanations of HTML:arser?
>
> I've read the perldoc and I don't understand it. It's gibberish to me.
>
> I've looked at the examples, but using them is cargo cult programming at
> its worst, I have no idea what they are doing and why.
>
> I understand I create an object. I understand I can then use this to do
> things, but as soon as it talks about handlers, it loses me.


One problem with HTML:arser appears to be its two available
interfaces. The description of the provided methods in the perldocs
isn't always quite clear about which API version a method relates to.

Maybe

<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

will help you. It deals with the old interface (subclassing) which I
find more convenient and easier to use.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
 
 
 
Zebee Johnstone
Guest
Posts: n/a
 
      08-27-2004
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval <(E-Mail Removed)> wrote:
>
>
> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>
>
> will help you. It deals with the old interface (subclassing) which I
> find more convenient and easier to use.


Thanks!


Zebee
 
Reply With Quote
 
Zebee Johnstone
Guest
Posts: n/a
 
      08-27-2004
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval <(E-Mail Removed)> wrote:
>
> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>


I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

Is there a way to associate the tag text with the tag, and only
use that?

So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine. IN the examples, it seems to use
(text => [], '@{dtext}' ) as args to the text handler, but I've no
idea where those come from at all, or what they are, or how to use them.
I have the "$self" object, which I can pass to a subroutine but no idea
how to get the things I need from it.

Zebee
 
Reply With Quote
 
Eric Bohlman
Guest
Posts: n/a
 
      08-27-2004
Zebee Johnstone <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

> Is there a way to associate the tag text with the tag, and only
> use that?


You might want to try HTML::TokeParser instead (it's included with the
HTML:arser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.
 
Reply With Quote
 
Zebee Johnstone
Guest
Posts: n/a
 
      08-27-2004
In comp.lang.perl.misc on 27 Aug 2004 04:06:58 GMT
Eric Bohlman <(E-Mail Removed)> wrote:
> You might want to try HTML::TokeParser instead (it's included with the
> HTML:arser distribution). It's a "pull" parser rather than a "push" one;
> rather than it calling your code in response to tags and text, you call it
> to get the next "token" which can be a start tag, text, end tag, etc. and
> then decide what to do with it. Using it is similar to reading through a
> file in a loop.



Bingo! Much easier to use and understand. Thanks.

Zebee

--
Zebee Johnstone ((E-Mail Removed)), proud holder of
aus.motorcycles Poser Permit #1.
"Motorcycles are like peanuts... who can stop at just one?"
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      08-27-2004
Also sprach Zebee Johnstone:

> In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
> Tassilo v. Parseval <(E-Mail Removed)> wrote:
>>
>> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

>
> I understand more now about it, but your tutorial doesn't cover the
> text, which I need.
>
> If I print out all the text elements:
>
> sub text {
> my($self, $origtext, $is_cdata) = @_;
> print "text [$origtext] \n";
> }
>
> then I get the text associated with the tags I'm after, but I get a lot
> of other text as well.


More specifically, you get all the plain text elements of the HTML file.

> Is there a way to associate the tag text with the tag, and only
> use that?


Yes, by keeping track in which tag the parser currently is.

> So a bit of HTML
> <a href="http://www.google.com"> Google </a> would have "Google"
> associated with "http://www.google.com"?
>
> ideally, I'd like to call the text subroutine from the start subroutine,
> and pass it a hash to put the text value in. And have it return that
> hash.


Those are handlers and they can't have such a return value. But you have
an object (the HTML:arser object) in which you can store the data:

#!/usr/bin/perl -w

package MyParser;

use strict;
use base qw/HTML:arser/;

sub start {
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };
$self->{in_a} = $attr->{ href };
}
}

sub end {
my ($self, $tagname) = @_;
delete $self->{in_a} if $tagname eq 'a';
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;
}
}

package main;

use Data:umper;
my $html = <<EOHTML;
<html>
<body>
<a href="http://www.first.com" target="bla">First link</a>
<a href="http://www.second.com">Second link</a>
</body>
</html>
EOHTML

my $p = MyParser->new;
$p->parse($html);
print Dumper $p->{a};
__END__
$VAR1 = [
{
'http://www.first.com' => 'One link'
},
{
'http://www.second.com' => 'Second link'
}
];

> It isn't clear to me what items the start subroutine knows about that
> it can pass to the text subroutine.


Handlers don't call each other. It's HTML:arser's parse-routines that
call the handlers whenever they encounter a start or end tag, a text
block or a comment. Handlers are called as-soon-as-event-happens.

> IN the examples, it seems to use (text => [], '@{dtext}' ) as args to
> the text handler, but I've no idea where those come from at all, or
> what they are, or how to use them. I have the "$self" object, which I
> can pass to a subroutine but no idea how to get the things I need from
> it.


This $self object is the object you create with 'HTML:arser->new'. Per
default it doesn't contain useful information. It holds the state of the
parser. But, as show above, you can abuse it as a cheap way of keeping
your own states. All I did was injecting two new member variables into
the object: $self->{in_a} which holds the URL when being inside an <a>
tag, otherwise this field does not exist. It is deleted in the
end-handler when $tagname is 'a'.

The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
Zebee Johnstone
Guest
Posts: n/a
 
      08-27-2004

Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
Tassilo v. Parseval <(E-Mail Removed)> wrote:
> my ($self, $tagname, $attr) = @_;
> if ($tagname eq 'a') {
> # store the URL as key of a new hash-ref
> # associated text not yet known, therefore undef
> push @{ $self->{a} }, { $attr->{href} => undef };


OK, given your explanation below, I think I get this.

> sub text {
> my ($self, $text) = @_;
> if (exists $self->{in_a}) {
> # text is between <a> and </a>
> $self->{a}->[-1]->{ $self->{in_a} } = $text;


Why -1? I don't understand this line at all...
>
> The second one is $self->{a}. This one is an array-ref of
> hash-references. Each new URL/text pair is recorded in there and pushed
> onto this array.
>
> When '$p->parse' returns you look at '$p->{a}' and there you have the
> data you want to extract.
>


Zebee
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      08-27-2004
Also sprach Zebee Johnstone:

> Bear with me please, I'm still getting to grips with a lot of notation
> and ideas...
>
> If that means I need to go read something to understand, please point me
> at it!


Your question is mostly about the data-structure that is used here. So
that would make it a perldsc/perlreftut/perlref-question.

> In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
> Tassilo v. Parseval <(E-Mail Removed)> wrote:
>> my ($self, $tagname, $attr) = @_;
>> if ($tagname eq 'a') {
>> # store the URL as key of a new hash-ref
>> # associated text not yet known, therefore undef
>> push @{ $self->{a} }, { $attr->{href} => undef };

>
> OK, given your explanation below, I think I get this.
>
>> sub text {
>> my ($self, $text) = @_;
>> if (exists $self->{in_a}) {
>> # text is between <a> and </a>
>> $self->{a}->[-1]->{ $self->{in_a} } = $text;

>
> Why -1? I don't understand this line at all...


Previously I did this:

push @{ $self->{a} }, { $attr->{href} => undef };

This means: $self->{a} is an array-reference. The hash-reference

{ $attr->{href} => undef }

is pushed onto this array-ref which means it is now the last element.

However, the hash-ref is incomplete. The value associated with they key
$attr->{href} is undef because we can't yet know the text enclosed in
<a> and </a>. But later we will (namely in the text() handler).

Once text is called, it's checked that we are inside <a>|</a>. If we
are, we finally have the text portion we wanted. We know that the
incomplete hash-reference is the last element in @{ $self->{a} }. And so
it becomes:

$self->{a}->[-1]

which is our previously created hash-reference. Only the value is
updated. The key was stored in $self->{in_a}:

$self->{a}->[-1]->{ $self->{in_a} } = $text;

I admit that the data-structure I used is not ideal. If you are sure
that the URLs defined in <a> tags are unique, you can do away with the
array-ref altogether:

sub start {
my ($self, $tag, $attr) = @_;
if ($tag eq 'a') {
$self->{in_a} = $attr->{href};
}
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
$self->{a}->{ $self->{in_a} } = $text;
delete $self->{in_a};
}
}

We didn't need the end-handler as I just realized. We can also delete
$self->{in_a} in text().

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
Bart Lateur
Guest
Posts: n/a
 
      08-27-2004
Zebee Johnstone wrote:

>Are there any tutorials or explanations of HTML:arser?
>
>I've read the perldoc and I don't understand it. It's gibberish to me.


The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.

--
Bart.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments