Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > HTML::TokeParser

Reply
Thread Tools

HTML::TokeParser

 
 
DVH
Guest
Posts: n/a
 
      10-16-2005
Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2">&nbsp;</td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

-----------------------------------------------------------

I think the problem lies in the ordering of tags, but that's as far as I've
got with working out what's wrong.


 
Reply With Quote
 
 
 
 
Stephen Hildrey
Guest
Posts: n/a
 
      10-16-2005
DVH wrote:
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')


You probably want ->[1] rather than ->[2]

Regards,
Steve
--
Stephen Hildrey
E-mail: http://www.velocityreviews.com/forums/(E-Mail Removed) / Tel: +442071931337
Jabber: (E-Mail Removed) / MSN: (E-Mail Removed)
 
Reply With Quote
 
 
 
 
it_says_BALLS_on_your forehead
Guest
Posts: n/a
 
      10-16-2005

DVH wrote:
> Hi,
>
> I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> URL followed by the link text.
>
> The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> at eurofeed.pl line 31"
>
> Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
> {"
>
> The HTML looks like this:
>
> =======================================
>
> <td colspan="2">&nbsp;</td>
>
> <td align="left" colspan="3">
>
> <a title="" class="docSel-titleLink"
> href="pressReleasesAction.do?reference=EPSO/05/06">
>
> My link text here
>
> </a>
>
> </td>
>
> </tr>
>
> ---------------------------------------------
>
> My script looks like this:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> use LWP::Simple;
>
> use HTML::TokeParser;
>
> use XML::RSS;
>
> my $content =
> et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
> hits=500" ) or die $!;
>
> my $stream = HTML::TokeParser->new( \$content ) or die $!;
>
> my ($tag, $headline, $url);
>
> while ( $tag = $stream->get_tag("a") ) {
>
> if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
>
> $url = $tag->[2]{href} || "--";
>
> $headline = $stream->get_trimmed_text('/a')
>
> print $url
>
> print $headline
>
> -----------------------------------------------------------
>
> I think the problem lies in the ordering of tags, but that's as far as I've
> got with working out what's wrong.


after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

....so you get an array reference back. why are you adding {class} into
your code?

 
Reply With Quote
 
it_says_BALLS_on_your forehead
Guest
Posts: n/a
 
      10-16-2005

it_says_BALLS_on_your forehead wrote:
> DVH wrote:
> > Hi,
> >
> > I'm trying to get tokeparser to fetch a series of hyperlinks and print the
> > URL followed by the link text.
> >
> > The following script ("eurofeed.pl") gives me "Can't coerce array into hash
> > at eurofeed.pl line 31"
> >
> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
> > {"
> >
> > The HTML looks like this:
> >
> > =======================================
> >
> > <td colspan="2">&nbsp;</td>
> >
> > <td align="left" colspan="3">
> >
> > <a title="" class="docSel-titleLink"
> > href="pressReleasesAction.do?reference=EPSO/05/06">
> >
> > My link text here
> >
> > </a>
> >
> > </td>
> >
> > </tr>
> >
> > ---------------------------------------------
> >
> > My script looks like this:
> >
> > #!/usr/bin/perl -w
> >
> > use strict;
> >
> > use LWP::Simple;
> >
> > use HTML::TokeParser;
> >
> > use XML::RSS;
> >
> > my $content =
> > et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
> > hits=500" ) or die $!;
> >
> > my $stream = HTML::TokeParser->new( \$content ) or die $!;
> >
> > my ($tag, $headline, $url);
> >
> > while ( $tag = $stream->get_tag("a") ) {
> >
> > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
> >
> > $url = $tag->[2]{href} || "--";
> >
> > $headline = $stream->get_trimmed_text('/a')
> >
> > print $url
> >
> > print $headline
> >
> > -----------------------------------------------------------
> >
> > I think the problem lies in the ordering of tags, but that's as far as I've
> > got with working out what's wrong.

>
> after searching on CPAN for HTML::TokeParser, and looking at the
> $p->get_tag( @tags ) method,
> it looks like:
>
> The tag information is returned as an array reference in the same form
> as for $p->get_token above, but the type code (first element) is
> missing. A start tag will be returned like this:
>
> [$tag, $attr, $attrseq, $text]
> The tagname of end tags are prefixed with "/", i.e. end tag is returned
> like this:
>
> ["/$tag", $text]
>
> ...so you get an array reference back. why are you adding {class} into
> your code?


ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

....yeah, you need to look at index 1, not index 2.

 
Reply With Quote
 
DVH
Guest
Posts: n/a
 
      10-16-2005

Stephen Hildrey <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> DVH wrote:
> > I'm trying to get tokeparser to fetch a series of hyperlinks and print

the
> > URL followed by the link text.
> >
> > The following script ("eurofeed.pl") gives me "Can't coerce array into

hash
> > at eurofeed.pl line 31"
> >
> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

'docSel-titleLink')
>
> You probably want ->[1] rather than ->[2]


I did. I had thought it would be tag[2] because I was looking for the third
tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they always
do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
there a reasonably simple way of getting rid of that? The site is at
http://europa.eu.int/rapid/recentPre...guage=en&hits=
10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then parsing
the text I find there. Would I be better off using WWW::Mechanize to do
this?

Thanks again for your help.


 
Reply With Quote
 
DVH
Guest
Posts: n/a
 
      10-16-2005

it_says_BALLS_on_your forehead <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
>
> it_says_BALLS_on_your forehead wrote:
> > DVH wrote:
> > > Hi,
> > >
> > > I'm trying to get tokeparser to fetch a series of hyperlinks and print

the
> > > URL followed by the link text.
> > >
> > > The following script ("eurofeed.pl") gives me "Can't coerce array into

hash
> > > at eurofeed.pl line 31"
> > >
> > > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

'docSel-titleLink')
> > > {"
> > >
> > > The HTML looks like this:
> > >
> > > =======================================
> > >
> > > <td colspan="2">&nbsp;</td>
> > >
> > > <td align="left" colspan="3">
> > >
> > > <a title="" class="docSel-titleLink"
> > > href="pressReleasesAction.do?reference=EPSO/05/06">
> > >
> > > My link text here
> > >
> > > </a>
> > >
> > > </td>
> > >
> > > </tr>
> > >
> > > ---------------------------------------------
> > >
> > > My script looks like this:
> > >
> > > #!/usr/bin/perl -w
> > >
> > > use strict;
> > >
> > > use LWP::Simple;
> > >
> > > use HTML::TokeParser;
> > >
> > > use XML::RSS;
> > >
> > > my $content =
> > >

t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
> > > hits=500" ) or die $!;
> > >
> > > my $stream = HTML::TokeParser->new( \$content ) or die $!;
> > >
> > > my ($tag, $headline, $url);
> > >
> > > while ( $tag = $stream->get_tag("a") ) {
> > >
> > > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
> > >
> > > $url = $tag->[2]{href} || "--";
> > >
> > > $headline = $stream->get_trimmed_text('/a')
> > >
> > > print $url
> > >
> > > print $headline
> > >
> > > -----------------------------------------------------------
> > >
> > > I think the problem lies in the ordering of tags, but that's as far as

I've
> > > got with working out what's wrong.

> >
> > after searching on CPAN for HTML::TokeParser, and looking at the
> > $p->get_tag( @tags ) method,
> > it looks like:
> >
> > The tag information is returned as an array reference in the same form
> > as for $p->get_token above, but the type code (first element) is
> > missing. A start tag will be returned like this:
> >
> > [$tag, $attr, $attrseq, $text]
> > The tagname of end tags are prefixed with "/", i.e. end tag is returned
> > like this:
> >
> > ["/$tag", $text]
> >
> > ...so you get an array reference back. why are you adding {class} into
> > your code?

>
> ahh, my mistake...
> use HTML::TokeParser;
> $p = HTML::TokeParser->new(shift||"index.html");
>
> while (my $token = $p->get_tag("a")) {
> my $url = $token->[1]{href} || "-";
> my $text = $p->get_trimmed_text("/a");
> print "$url\t$text\n";
> }
>
> ...yeah, you need to look at index 1, not index 2.
>


Thanks. It works with [1].


 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      10-16-2005
"DVH" <(E-Mail Removed)> wrote in
news:diug96$jfj$(E-Mail Removed)-infra.bt.com:

>
> Stephen Hildrey <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> DVH wrote:
>> > I'm trying to get tokeparser to fetch a series of hyperlinks and
>> > print the URL followed by the link text.
>> >
>> > The following script ("eurofeed.pl") gives me "Can't coerce array
>> > into hash at eurofeed.pl line 31"
>> >
>> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

> 'docSel-titleLink')
>>
>> You probably want ->[1] rather than ->[2]

>
> I did. I had thought it would be tag[2] because I was looking for the
> third tag within those brackets, but obviously not.
>
> Thank you, that now works. I have a couple more questions (ah they
> always do...)
>
> Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.


ITYM "the HTML contains".


> Is there a reasonably simple way of getting rid of that? The site is
> at
> http://europa.eu.int/rapid/recentPre...asesAction.do?

guiLanguage=en&
> hits= 10 if you need to see it.
>
> Secondly, I'm working towards getting following those hrefs and then
> parsing the text I find there. Would I be better off using
> WWW::Mechanize to do this?


#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPre...asesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data:umper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}


__END__

--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
DVH
Guest
Posts: n/a
 
      10-19-2005

A. Sinan Unur <(E-Mail Removed)> wrote in message
news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
> "DVH" <(E-Mail Removed)> wrote in
> news:diug96$jfj$(E-Mail Removed)-infra.bt.com:
>
> >
> > Stephen Hildrey <(E-Mail Removed)> wrote in message
> > news:(E-Mail Removed)...
> >> DVH wrote:
> >> > I'm trying to get tokeparser to fetch a series of hyperlinks and
> >> > print the URL followed by the link text.
> >> >
> >> > The following script ("eurofeed.pl") gives me "Can't coerce array
> >> > into hash at eurofeed.pl line 31"
> >> >
> >> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

> > 'docSel-titleLink')
> >>
> >> You probably want ->[1] rather than ->[2]

> >
> > I did. I had thought it would be tag[2] because I was looking for the
> > third tag within those brackets, but obviously not.
> >
> > Thank you, that now works. I have a couple more questions (ah they
> > always do...)
> >
> > Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

>
> ITYM "the HTML contains".
>
>
> > Is there a reasonably simple way of getting rid of that? The site is
> > at
> > http://europa.eu.int/rapid/recentPre...asesAction.do?

> guiLanguage=en&
> > hits= 10 if you need to see it.
> >
> > Secondly, I'm working towards getting following those hrefs and then
> > parsing the text I find there. Would I be better off using
> > WWW::Mechanize to do this?

>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use HTML::LinkExtractor;
> use LWP::Simple;
>
> my $url = q{http://europa.eu.int/rapid/recentPre...asesAction.do?
> guiLanguage=en};
> my $html = get $url;
>
> die "Cannot get <$url>\n" unless $html;
>
> my $lx = HTML::LinkExtractor->new;
> $lx->parse(\$html);
>
> use Data:umper;
>
> for my $link ( @{ $lx->links } ) {
> if ($link->{class} eq 'docSel-formatLink') {
> print Dumper $link;
> }
> }
>
>
> __END__


Sorry for getting back to you three days late, but thanks to both of you.


 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      10-19-2005
"DVH" <(E-Mail Removed)> wrote in news:dj6a0n$7a8$1
@nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:

> A. Sinan Unur <(E-Mail Removed)> wrote in message
> news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...

....
> Sorry for getting back to you three days late, but thanks to both
> of you.


You are welcome. Hope it helped.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments