Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Erroneous Text Extraction using HTML::Parser

Reply
Thread Tools

Erroneous Text Extraction using HTML::Parser

 
 
Himanshu Garg
Guest
Posts: n/a
 
      01-27-2004
Hello,
I am using HTML:arser to extract text from html pages from
http://bbc.co.uk/urdu/

However the encoding of the input text seems to change to some
unknown encoding in the output.

The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.

################################################## ###############
use HTML:arser;

# set standard output to utf8
binmode(STDOUT, ":utf8");

# Create parser object
my $p = HTML:arser->new( api_version => 3, text_h => [\&text,
"text"] );

# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");

sub text
{
my ($txt) = @_;
print $txt;
}
################################################## ###############

Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.

Thank You
Himanshu.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Struts File Upload Error - erroneous handlers watergirl Java 4 10-10-2006 08:38 PM
Erroneous Deletion Zygy Computer Support 3 04-22-2006 01:20 PM
Erroneous line number error in Py2.4.1 [Windows 2000+SP3] Timo Python 2 04-07-2005 04:39 PM
A technique for compile-time detection of erroneous bit-masks -opinionsrequested Gianni Mariani C++ 0 01-13-2005 07:24 AM
Ed Wood Special Edition: Hopefully This is Erroneous Bill DVD Video 12 02-03-2004 06:36 PM



Advertisments