Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > cElementTree encoding woes

Reply
Thread Tools

cElementTree encoding woes

 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      02-20-2006
Hi,

I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Diez
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      02-20-2006
Diez B. Roggisch wrote:

> I've got to deal with a pretty huge XML-document, and to do so I use the
> cElementTree.iterparse functionality. Working great.
>
> Only trouble: The guys creating that chunk of XML - well, lets just say
> they are "encodingly challanged", so they don't produce utf-8, but only
> cp1252 instead, together with some weird name (Windows-1252) for that.
> That is not part of the standard codecs module. cp1252 is, of course.
>
> But that won't work for iterparse. So currently, I manually change the
> encoding given to utf-8, and use a stream-recoder.
>
> However, I was wondering if I could teach cElementTree about that encoding
> name. I tried to register cp1252 under the name Windows-1252, but had no
> luck - cET won't buy it.
>
> Any suggestions?


Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":

>>> import codecs
>>> codecs.open("windows.xml", encoding="windows-1252")

<open file 'windows.xml', mode 'rb' at 0x403737e0>

Maybe the problem lies in the python installation rather than cElementTree?
Just guessing, though.

Peter

 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      02-20-2006
> Both my python2.3 and python2.4 interpreters seem to know "Windows-1252":
>
>>>> import codecs
>>>> codecs.open("windows.xml", encoding="windows-1252")

> <open file 'windows.xml', mode 'rb' at 0x403737e0>
>
> Maybe the problem lies in the python installation rather than
> cElementTree? Just guessing, though.


Hm. No idea why I was under the impression they weren't there - but still,
it doesn't work: I get

inf = file(sys.argv[1])
#inf = codecs.StreamRecoder(inf,encoder, decoder, reader, writer)

for event, elem in cElementTree.iterparse(inf):
pass

pukes on me with

Traceback (most recent call last):
File "./splitter.py", line 31, in ?
for event, elem in cElementTree.iterparse(inf):
File "<string>", line 61, in __iter__
SyntaxError: not well-formed (invalid token): line 35, column 34

That is the first french character encountered.

"""<title>Introduction aux Probabilit├ęs</title>"""


So - then the problem is not the codec being ignored, but it simply is not
working.

Regards,

Diez
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      02-20-2006
Diez B. Roggisch wrote:

> I've got to deal with a pretty huge XML-document, and to do so I use the
> cElementTree.iterparse functionality. Working great.
>
> Only trouble: The guys creating that chunk of XML - well, lets just say they
> are "encodingly challanged", so they don't produce utf-8, but only cp1252
> instead, together with some weird name (Windows-1252) for that. That is not
> part of the standard codecs module. cp1252 is, of course.
>
> But that won't work for iterparse. So currently, I manually change the
> encoding given to utf-8, and use a stream-recoder.
>
> However, I was wondering if I could teach cElementTree about that encoding
> name. I tried to register cp1252 under the name Windows-1252, but had no
> luck - cET won't buy it.


you need cET 1.0.5 or later for this to work. for earlier versions, you have to use
stream recoding:

http://effbot.org/zone/celementtree-encoding.htm

</F>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using cElementTree and elementtree.ElementInclude Mark E. Smith Python 0 10-23-2006 02:40 PM
Using cElementTree and elementtree.ElementInclude Mark Python 0 10-20-2006 09:38 PM
cElementTree clear semantics Igor V. Rafienko Python 27 09-26-2005 11:56 AM
Subclassing cElementTree.Element Kent Johnson Python 1 02-08-2005 08:50 AM
ANN: cElementTree 0.9.8 (january 23, 2005) Fredrik Lundh Python 0 01-23-2005 03:29 PM



Advertisments