Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > encoding in lxml

Reply
Thread Tools

encoding in lxml

 
 
jasiu85
Guest
Posts: n/a
 
      11-03-2008
Hey,

I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike
 
Reply With Quote
 
 
 
 
pjacobi.de@googlemail.com
Guest
Posts: n/a
 
      11-03-2008
Hi Mike,

> I read an HTML document from a third-party site. It is supposed to be
> in UTF-8, but unfortunately from time to time it's not.


There will be host of more lightweight solutions, but you can opt
to sanizite incominhg HTML with HTML Tidy (python binding available).

It will replace invalid UTF-8 bytes with U+FFFD. It will not
guess a better encoding to use.

If you are sure you don't have HTML sloppiness to correct but only
the
occasional wrong byte, even decoding (with fallback) and encoding
using
the standard codec package will do.

Regards,
Peter
 
Reply With Quote
 
 
 
 
Stefan Behnel
Guest
Posts: n/a
 
      11-03-2008
jasiu85 wrote:
> I have a problem with character encoding in LXML. Here's how it goes:
>
> I read an HTML document from a third-party site. It is supposed to be
> in UTF-8, but unfortunately from time to time it's not.


You can instantiate your own HTML parser and pass encoding="utf-8". That way,
when it's not UTF-8, you will get an exception at parse time, which allows you
to reparse the document with another encoding (say, ISO-8859-1) to get the
correct content.

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
[ANN] lxml 0.9 is out! Stefan Behnel Python 0 03-20-2006 08:17 PM
ANN: MathDOM 0.5.2 - MathML in Python - now featuring lxml API! Stefan Behnel Python 0 10-17-2005 09:30 AM
changing JVM encoding; setting -Dfile.encoding doesn't work pasmol@plusnet.pl Java 1 10-08-2004 09:50 PM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM



Advertisments