Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > xml processing and sys.setdefaultencoding

Reply
Thread Tools

xml processing and sys.setdefaultencoding

 
 
christof hoeke
Guest
Posts: n/a
 
      07-20-2003
hi,
i wrote a small application which extracts a javadoc similar documentation
for xslt stylesheets using python, xslt and pyana.
using non-ascii characters was a problem. so i set the defaultending to
UTF-8 and now everything works (at least it seems so, need to do more
testing though).

it may not be the most elegant solution (according to python in a nutshell)
but it almost seems when doing xml processing it is mandatory to set the
default encoding. xml processing should almost only work with unicode
strings and this seems the easiest solution.

any comments on this? better ways to work

thanks
chris


 
Reply With Quote
 
 
 
 
Alan Kennedy
Guest
Posts: n/a
 
      07-20-2003
christof hoeke wrote:

> i wrote a small application which extracts a javadoc similar
> documentation
> for xslt stylesheets using python, xslt and pyana.
> using non-ascii characters was a problem.


That's odd. Did your stylesheets contain non-ascii characters? If yes,
did you declare the character encoding at the beginning of the
document, e.g.

"<?xml version="1.0" encoding="iso-8859-1"?>

> so i set the [python] defaultending to
> UTF-8 and now everything works (at least it seems so, need to do more
> testing though).


If you don't put an encoding declaration in your XML documents
(including XSLT style/transform sheets), then an XML parser would by
default treat the document content as UTF-(8|16), as the XML standard
mandates.

Are you working from XML documents which are stored as strings inside
a python module? In which case, your special characters will actually
be encoded in whatever encoding your python module is stored. So you
might need to put an encoding declaration on your python module:-

http://www.python.org/peps/pep-0263.html

> it may not be the most elegant solution (according to python in a
> nutshell)
> but it almost seems when doing xml processing it is mandatory to set the
> default encoding. xml processing should almost only work with unicode
> strings and this seems the easiest solution.


It is always recommended to explicitly state the encoding on your XML
documents. If you don't, then the parser assumes UTF-(8|16). If your
documents aren't really UTF-(8|16), then you will get seemingly random
mapping of characters to other characters.

> any comments on this? better ways to work


If you're not dealing specifically with ASCII, then declare your
encodings, in both your python modules and your xml documents. Find
out what is the default character set used by your text editor. Find
out how to change which character set is in use.

If you create, sell or maintain text editing or processing software,
make it easy for your users to find out what character encodings are
in effect.

HTH,

--
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/mailto/alan
 
Reply With Quote
 
 
 
 
Martin v. =?iso-8859-15?q?L=F6wis?=
Guest
Posts: n/a
 
      07-20-2003
"christof hoeke" <(E-Mail Removed)> writes:

> using non-ascii characters was a problem. so i set the defaultending to
> UTF-8 and now everything works (at least it seems so, need to do more
> testing though).


Can you please be more precise as to what problem exactly you have
observed?

Regards,
Martin

 
Reply With Quote
 
christof hoeke
Guest
Posts: n/a
 
      07-20-2003
hi,
first thanks for the infos. i need to try the encoding declaration in the
python module.

some more details about the problem i had (regarding the posts by Alan and
Martin):

the original problem with the app was that the Pyana transformation
complained about the string "xml" when it came over as unicode. so i used
str(xml) but that gave the usual "ordinal not in range" error when the xslt
contained e.g. german umlauts. i did not tried that before...
setting the default encoding to utf-8 fixed that. the reason is not entirely
clear to me yet though.

- the used xslt stylesheets should have been in utf-8 as i did not state an
encoding explicitly
- xslt with latin-1 (iso8859-1) encoding should work too though
- xslt contains german umlauts etc.
- i did extract parts of the xslt in python strings, yes

i read the other threads about unicode and also about PEP 0263. i have not
tried to set the encoding of the python file yet. but sounds promising.
i am wondering though, if i set the python file encoding to e.g. utf-8 and
then use a stylesheet with, lets say latin-1 encoding, i still have a
mismatch, havn't i?

if you are interested in the code, download it from
http://cthedot.de/pyxsldoc/
it is my first "bigger" python project, so the code is not the best i guess
and the version which does not work is still online. i need to put on the
version with the changed default encoding.

chris



christof hoeke wrote:
> hi,
> i wrote a small application which extracts a javadoc similar
> documentation for xslt stylesheets using python, xslt and pyana.
> using non-ascii characters was a problem. so i set the defaultending
> to UTF-8 and now everything works (at least it seems so, need to do
> more testing though).
>
> it may not be the most elegant solution (according to python in a
> nutshell) but it almost seems when doing xml processing it is
> mandatory to set the default encoding. xml processing should almost
> only work with unicode strings and this seems the easiest solution.
>
> any comments on this? better ways to work
>
> thanks
> chris



 
Reply With Quote
 
Martin v. =?iso-8859-15?q?L=F6wis?=
Guest
Posts: n/a
 
      07-21-2003
"christof hoeke" <(E-Mail Removed)> writes:

> the original problem with the app was that the Pyana transformation
> complained about the string "xml" when it came over as unicode. so i used
> str(xml) but that gave the usual "ordinal not in range" error when the xslt
> contained e.g. german umlauts.


At that point, you should have done

xml = xml.encode("utf-8")

where you might need to make sure that the string "utf-8" matches the
encoding= given in the xml header.

> i did not tried that before... setting the default encoding to
> utf-8 fixed that. the reason is not entirely clear to me yet though.


For any Unicode object X, str(X) is equivalent to
X.encode(sys.getdefaultencoding()). Since that defaults to "ascii",
str(X) is normally the same as X.encode("ascii"), which fails if you
have non-ASCII in your string.

> it is my first "bigger" python project, so the code is not the best
> i guess and the version which does not work is still online. i need
> to put on the version with the changed default encoding.


I advise that you get rid of the need to set the default
encoding. Many users will have set this to a value different from
"utf-8".

Regards,
Martin

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Post-Processing RAW vs Post-Processing TIFF Mike Henley Digital Photography 42 01-30-2005 08:26 AM
FYI: VTD-XML, the latest XML processing model Jimmy zhang XML 0 11-11-2004 01:26 AM
FYI: VTD-XML, the latest XML processing model Jimmy zhang Java 0 11-11-2004 01:25 AM
Question: processing HTML, re-write default processing action of many tags Hubert Hung-Hsien Chang Python 2 09-17-2004 03:10 PM
What XML technologies to learn first for "XML Processing" and "XML Mapping"? Bomb Diggy Java 0 07-28-2004 07:26 AM



Advertisments