Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: lxml can't output right unicode result

Reply
Thread Tools

Re: lxml can't output right unicode result

 
 
MRAB
Guest
Posts: n/a
 
      09-07-2012
On 07/09/2012 01:21, contro opinion wrote:
> i eidt a file and save it in gbk encode named test. my system is
> :debian,locale,en.utf-8;python2.6,locale,utf-8.
>
> <html>
> <p>*</p>
> </html>
>
> in terminal i input:
>
> xxd test
>
> 0000000: 3c68 746d 6c3e 0a3c 703e c4e3 3c2f 703e <html>.<p>..</p>
> 0000010: 0a3c 2f68 746d 6c3e 0a .</html>.
>
> * is you in english,
> "\xc4\xe3" is the gbk encode of it.
> "\xe4\xbd\xe3" is the utf-8 encode of it.
> "u\x4f\x60" is the unicode encode of it.
> now i parse it in lxml
>
> >>> "*"

> '\xe4\xbd\xa0'
> >>> "*".decode("utf-8")

> u'\u4f60'
> >>> "*".decode("utf-8").encode("gbk")

> '\xc4\xe3'
> >>>

>
> code1:
>
> >>> import lxml.html
> >>> root=lxml.html.parse("test")
> >>> d=root.xpath("//p")
> >>> d[0].text_content()

> u'\xc4\xe3'
>
> in material ,lxml parse file to output the unicode form.
> why the d[0].text_content() can not output u'\x4f\x60'?
>
> code2:
>
> import codecs
> import lxml.html
> f = codecs.open('test', 'r', 'gbk')
> root=lxml.html.parse(f)
> d=root.xpath("//p")
> d[0].text_content()
> u'\xe4\xbd\xa0'
>
> why the d[0].text_content() can not output u'\x4f\x60'?
>
> i am confused by this problem for two days.
>

You can't just put some text into a file and expect it to know
"magically" what the encoding is. You have to specify that the encoding
is GBK, something like this (in a file actually encoded as GBK, of
course):

<html>
<meta http-equiv="content-type" content="text/html; charset=gbk">
<p>*</p>
</html>

I hope there's a good reason why you're using that encoding and not
UTF-8.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
i = 10; result = ++i - --i; How result become ZERO Lakshmi Sreekanth C Programming 52 09-23-2010 07:41 AM
Re: i = 10; result = ++i - --i; How result become ZERO Mr. Buffoon C Programming 4 09-23-2010 03:01 AM
Is the result of valid dynamic cast always equal to the result ofcorrespondent static cast? Pavel C++ 7 09-18-2010 11:35 PM
simulation result is correct but synthesis result is not correct J.Ram VHDL 7 12-03-2008 01:26 PM
1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds Michael Tan Ruby 32 07-21-2005 03:23 PM



Advertisments