Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > HTML::TableExtract punctuation parsing

Reply
Thread Tools

HTML::TableExtract punctuation parsing

 
 
Maqo
Guest
Posts: n/a
 
      05-22-2005
Is there any way to prevent HTML::TableExtract from mangling punctuation
in parsed text? For example, the below code is parsing “don’t come”
in the target URL as “don’t come”. Is it something about the
document encoding, or a limitation of the module?

Many thanks!

------------------------------------------------------------

use LWP::Simple;
use HTML::TableExtract;

$URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";

$content = get($URL);
my $te = new HTML::TableExtract( depth=>1, count=>4, gridmap=>0,
keep_html=>1);

$te->parse($content);
foreach $ts ($te->table_states)
{
foreach $row ($ts->rows)
{
print $$row[0];
}
}
 
Reply With Quote
 
 
 
 
Bob Walton
Guest
Posts: n/a
 
      05-23-2005
Maqo wrote:

> Is there any way to prevent HTML::TableExtract from mangling punctuation
> in parsed text? For example, the below code is parsing “don’t come” in
> the target URL as “don’t come”. Is it something about the
> document encoding, or a limitation of the module?

....

>$URL =

"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";


My browser says that web page is Unicode with UTF-8 encoding. If
you process it as Unicode with UTF-8 encoding, you'll probably be
fine. Otherwise, as you noted, you'll get gibberish. If you
view the results of your print() with a Unicode with UTF-8
viewer, you should be OK, as you are doing nothing that should
alter the non-ASCII characters.

Your web browser is probably a good candidate for such a viewer,
providing you set the right character code/encoding.

....
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
 
Reply With Quote
 
 
 
 
Maqo
Guest
Posts: n/a
 
      05-24-2005
Bob Walton wrote:

> My browser says that web page is Unicode with UTF-8 encoding. If you
> process it as Unicode with UTF-8 encoding, you'll probably be fine.
> Otherwise, as you noted, you'll get gibberish. If you view the results
> of your print() with a Unicode with UTF-8 viewer, you should be OK, as
> you are doing nothing that should alter the non-ASCII characters.


Thanks Bob, that's what I had suspected as well, which is why I can't
for the life of me understand why this is still giving me gibberish (I
must be missing something with respect to proper decoding of UTF-:

use LWP::UserAgent;
use HTML::Encoding 'encoding_from_http_message';
use Encode;

my $URL =
"http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";

my $content = LWP::UserAgent->new->get($URL, 'Accept-Charset'=>'UTF-8');
my $enco = encoding_from_http_message($content);
my $utf8 = decode($enco => $content->content());
open (OUT, ">:encoding(utf", "out.html");
print OUT $utf8;
close (OUT);
 
Reply With Quote
 
Bob Walton
Guest
Posts: n/a
 
      05-25-2005
Maqo wrote:

> Bob Walton wrote:
>
>> My browser says that web page is Unicode with UTF-8 encoding. If you
>> process it as Unicode with UTF-8 encoding, you'll probably be fine.
>> Otherwise, as you noted, you'll get gibberish. If you view the
>> results of your print() with a Unicode with UTF-8 viewer, you should
>> be OK, as you are doing nothing that should alter the non-ASCII
>> characters.

>
>
> Thanks Bob, that's what I had suspected as well, which is why I can't
> for the life of me understand why this is still giving me gibberish (I
> must be missing something with respect to proper decoding of UTF-:
>
> use LWP::UserAgent;
> use HTML::Encoding 'encoding_from_http_message';
> use Encode;
>
> my $URL =
> "http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";
>
>
> my $content = LWP::UserAgent->new->get($URL, 'Accept-Charset'=>'UTF-8');
> my $enco = encoding_from_http_message($content);
> my $utf8 = decode($enco => $content->content());
> open (OUT, ">:encoding(utf", "out.html");
> print OUT $utf8;
> close (OUT);


Well, I'm certainly no expert at all these encodings, but I note
that when running your program above verbatim, one still ends up
with "out.html" containing UTF-8 encoded Unicode. In fact,
out.html is character-for-character identical with the file
generated from:

use LWP::Simple;
open OUT,">out1.html" or die "Oops, $!";
print OUT get('http://www.p...');
#[trailing portion of long URL elided]

It seems that what you really want to do is convert the "weird"
quote and apostrophe characters and the em-dash from Unicode to
their nearest ASCII equivalents. There is certainly no
general-purpose converter to take Unicode and make "best guess"
ASCII out of it (what would it do with Chinese characters, for
example?). Perl can convert the UTF-8 encoding to true Unicode
in Perl strings (which apparently is happening with your $utf8
variable), and one could then use the tr/// operator to convert
the unwanted codes to the ASCII characters you want to use as
their approximation.

For example, try adding this line to your above program just
after your "my $utf8..." line and before the open():

$utf8=~tr/\x{2019}\x{201c}\x{201d}\x{2013}/'""-/;

and see if that will suffice. It appears as if the call to
->decode() of the Encode module is needed to convert the UTF-8
encoding from the web page to a true Unicode string. It may thus
be misleading to call it $utf8 -- perhaps $unicode would be more
descriptive?

BTW, you should test your open to ensure it executed
successfully. The typical paradigm is:

open(...) or die "Your error message, $!";

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Stripping out punctuation marks dew ASP .Net 1 02-07-2006 12:06 AM
Combine 2 Columns to one with punctuation DBLWizard ASP .Net 10 04-02-2005 12:07 AM
Remove punctuation from String? dfhLASST Java 4 11-11-2004 03:57 PM
Re: Regular expression for punctuation Chris R. Timmons ASP .Net 0 07-10-2003 03:57 AM
Regular expression for punctuation Chris Leffer ASP .Net 0 07-09-2003 02:48 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57