Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > ElementTree cannot parse UTF-8 Unicode?

Reply
Thread Tools

ElementTree cannot parse UTF-8 Unicode?

 
 
Erik Bethke
Guest
Posts: n/a
 
      01-19-2005
Hello All,

I am getting an error of not well-formed at the beginning of the Korean
text in the second example. I am doing something wrong with how I am
encoding my Korean? Do I need more of a wrapper about it than simple
quotes? Is there some sort of XML syntax for indicating a Unicode
string, or does the Elementree library just not support reading of
Unicode?

here is my test snippet:

from elementtree import ElementTree
vocabXML = ElementTree.parse('test2.xml').getroot()

where I have two data files:

this one works:
<?xml version="1.0" encoding="UTF-8"?>
<Vocab>
<Word L1='Hahha'></Word>
</Vocab>

this one fails:
<?xml version="1.0" encoding="UTF-8"?>
<Vocab>
<Word L1="어녕하세요!"></Word>
</Vocab>

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-19-2005
Erik Bethke wrote:

> I am getting an error of not well-formed at the beginning of the Korean
> text in the second example. I am doing something wrong with how I am
> encoding my Korean? Do I need more of a wrapper about it than simple
> quotes? Is there some sort of XML syntax for indicating a Unicode
> string, or does the Elementree library just not support reading of
> Unicode?


XML is Unicode, and ElementTree supports all common encodings just
fine (including UTF-.

> this one fails:
> <?xml version="1.0" encoding="UTF-8"?>
> <Vocab>
> <Word L1="?????!"></Word>
> </Vocab>


this works just fine on my machine.

what's the exact error message?

what does

print repr(open("test2.xml").read())

print on your machine?

what happens if you attempt to parse

<Vocab>
<Word L1="어녕하세요!" />
</Vocab>

?

</F>



 
Reply With Quote
 
 
 
 
Erik Bethke
Guest
Posts: n/a
 
      01-20-2005
Hello Fredrik,

1) The exact error is in line 1160 of self._parser.Parse(data, 0 ):
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3,
column 16

2) You are right in that the print of the file read works just fine.

3) You are also right in that the digitally encoded unicode also works
fine. However, this solution has two new problems:

1) The xml file is now not human readable
2) After ElementTree gets done parsing it, I am feeding the text to a
wx.TextCtrl via .SetValue() but that is now giving me an error message
of being unable to convert that style of string

So it seems to me, that ElementTree is just not expecting to run into
the Korean characters for it is at column 16 that these begin. Am I
formatting the XML properly?

Thank you,
-Erik

 
Reply With Quote
 
Jeremy Bowers
Guest
Posts: n/a
 
      01-20-2005
On Wed, 19 Jan 2005 16:35:23 -0800, Erik Bethke wrote:
> So it seems to me, that ElementTree is just not expecting to run into the
> Korean characters for it is at column 16 that these begin. Am I
> formatting the XML properly?


You should post the file somewhere on the web. (I wouldn't expect Usenet
to transmit it properly.)

(Just jumping in to possibly save you a reply cycle.)

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-20-2005
Erik Bethke wrote:

> 2) You are right in that the print of the file read works just fine.


but what does it look like? I saved a raw copy of your original mail,
fixed the quoted-printable encoding, and got an UTF-8 encoded file
that works just fine. the thing you've been parsing, and that you've
cut and pasted into your mail, must be different, in some way.

> 3) You are also right in that the digitally encoded unicode also works
> fine. However, this solution has two new problems:


that was just a test to make sure that your version of elementtree could
handle Unicode characters on your platform.

> 1) The xml file is now not human readable
> 2) After ElementTree gets done parsing it, I am feeding the text to a
> wx.TextCtrl via .SetValue() but that is now giving me an error message
> of being unable to convert that style of string


on my machine, the L1 attribute contains a Unicode string:

>>> print repr(root.find("Word").get("L1"))

u'\uc5b4\ub155\ud558\uc138\uc694!'

what does it give you on your machine? (looks like wxPython cannot handle
Unicode strings, but can that really be true?)

> So it seems to me, that ElementTree is just not expecting to run into
> the Korean characters for it is at column 16 that these begin. Am I
> formatting the XML properly?


nobody knows...

</F>



 
Reply With Quote
 
Do Re Mi chel La Si Do
Guest
Posts: n/a
 
      01-20-2005
Hi !

>>> ...Usenet to transmit it properly


newsgroups (NNTP) : yes, it does it
usenet : perhaps (that depends on the newsgroups)
clp : no





Michel Claveau


 
Reply With Quote
 
Jorge Luiz Godoy Filho
Guest
Posts: n/a
 
      01-20-2005
Fredrik Lundh, Quinta 20 Janeiro 2005 05:17, wrote:

> what does it give you on your machine? (looks like wxPython cannot handle
> Unicode strings, but can that really be true?)


It does support Unicode if it was built to do so...

--
Godoy. <(E-Mail Removed)>

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-20-2005
Jorge Luiz Godoy Filho wrote:

>> what does it give you on your machine? (looks like wxPython cannot handle
>> Unicode strings, but can that really be true?)

>
> It does support Unicode if it was built to do so...


Python has supported Unicode in release 1.6, 2.0, 2.1, 2.2, 2.3 and 2.4, so
you might think that Unicode should be enabled by default in a UI toolkit for
Python...

</F>



 
Reply With Quote
 
Erik Bethke
Guest
Posts: n/a
 
      01-20-2005
There is something wrong with the physical file... I d/l a trial
version of XML Spy home edition and built an equivalent of the korean
test file, and tried it and it got past the element tree error and now
I am stuck with the wxEditCtrl error.

To build the xml file in the first place I had code that looked like
this:

d=wxFileDialog( self, message="Choose a file",
defaultDir=os.getcwd(), defaultFile="", wildcard="*.xml", style=wx.SAVE
| wxOVERWRITE_PROMPT | wx.CHANGE_DIR)
if d.ShowModal() == wx.ID_OK:
# This returns a Python list of files that were selected.
paths = d.GetPaths()
layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
L1Word = self.t1.GetValue()
L2Word = 'undefined'

layout += '<Vocab>\n'
layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
layout += '</Vocab>'
open( paths[0], 'w' ).write(layout)
d.Destroy()

So apprantly there is something wrong with physically constructing the
file in this manner?

Thank you,
-Erik

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-20-2005
Erik Bethke wrote:

> layout += '<Vocab>\n'
> layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'


what does "print repr(L1Word)" print (that is, what does wxPython return?).
it should be a Unicode string, but that would give you an error when you write
it out:

>>> f = open("file.txt", "w")
>>> f.write(u'\uc5b4\ub155\ud558\uc138\uc694!')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters
in position 0-4: ordinal not in range(12

have you hacked the default encoding in site/sitecustomize?

what happens if you replace the L1Word term with L1Word.encode("utf-8")

can you post the repr() (either of what's in your file or of the thing, whatever
it is, that wxPython returns...)

</F>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ElementTree.XML(string XML) and ElementTree.fromstring(string XML)not working Kee Nethery Python 12 06-27-2009 06:06 AM
How to parse xml with ISO-8859-1 encoding using ElementTree andSimpleXMLTreeBuilder? Zvi Python 0 05-13-2008 08:26 PM
ElementTree should parse string and file in teh same way Peter Pei Python 17 01-03-2008 02:15 PM
XML ElementTree Parse. marc.wyburn@googlemail.com Python 2 10-12-2006 11:02 AM
ElementTree : parse string input rajarshi.guha@gmail.com Python 2 07-06-2006 04:07 PM



Advertisments