Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > umlauts

Reply
Thread Tools

umlauts

 
 
Arian Kuschki
Guest
Posts: n/a
 
      10-17-2009
Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
><forecast_information><cit

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87�%"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian



 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-17-2009
Arian Kuschki schrieb:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>> <forecast_information><cit

> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87�%"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.


The encoding of the python-source-file has nothing to do with this. It's
only relevant for unicode-literals (in python 2.x, that's u"...")

>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.


If we assume the data on the website is correct (it appears to be when I
open it in FF), then your problem is most probably your display/terminal.

What does this show you in your interactive interpreter?

>>> print "\xc3\xb6"

ö

For me, it's o-umlaut, ö. This is because the above bytes are the
sequence for ö in utf-8.

If this shows something else, you need to adjust your terminal settings.

Diez
 
Reply With Quote
 
 
 
 
StarWing
Guest
Posts: n/a
 
      10-17-2009
On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
wrote:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>
> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87 %"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.
>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> Cheers
> Arian


try this?

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs

def endElement(self, name):
print "end:", name

sax.parseString(xml, my_handler())
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-17-2009
StarWing schrieb:
> On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
> wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able to
>> understand what to do. I always have problems when dealing input text that
>> contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87 %"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>> Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When I want
>> to process this text (for example with xml.sax), I get error messages because
>> the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the web, but
>> nothing seems to work for me. For example setting the coding to UTF like this:
>> # -*- coding: utf-8 -*- or using the decode() string method.
>>
>> I always have this kind of problem when input contains umlauts, not just in
>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
>> Cheers
>> Arian

>
> try this?
>
> # vim: set fencoding=utf-8:
> import urllib
> import xml.sax as sax, xml.sax.handler as handler
>
> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> xml = f.read()
> xml = xml.decode("cp1252")
> f.close()
>
> class my_handler(handler.ContentHandler):
> def startElement(self, name, attrs):
> print "begin:", name, attrs
>
> def endElement(self, name):
> print "end:", name
>
> sax.parseString(xml, my_handler())


This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-17-2009
StarWing schrieb:
> On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
> wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able to
>> understand what to do. I always have problems when dealing input text that
>> contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87 %"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>> Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When I want
>> to process this text (for example with xml.sax), I get error messages because
>> the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the web, but
>> nothing seems to work for me. For example setting the coding to UTF like this:
>> # -*- coding: utf-8 -*- or using the decode() string method.
>>
>> I always have this kind of problem when input contains umlauts, not just in
>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
>> Cheers
>> Arian

>
> try this?
>
> # vim: set fencoding=utf-8:
> import urllib
> import xml.sax as sax, xml.sax.handler as handler
>
> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> xml = f.read()
> xml = xml.decode("cp1252")
> f.close()
>
> class my_handler(handler.ContentHandler):
> def startElement(self, name, attrs):
> print "begin:", name, attrs
>
> def endElement(self, name):
> print "end:", name
>
> sax.parseString(xml, my_handler())


This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez
 
Reply With Quote
 
StarWing
Guest
Posts: n/a
 
      10-17-2009
On 10月18日, 上午12时50分, "Diez B. Roggisch" <(E-Mail Removed)> wrote:
> StarWing schrieb:
>
>
>
> > On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
> > wrote:
> >> Hi all

>
> >> this has been bugging me for a long time and I do not seem to be able to
> >> understand what to do. I always have problems when dealing input text that
> >> contains umlauts. Consider the following:

>
> >> In [1]: import urllib

>
> >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

>
> >> In [3]: xml = f.read()

>
> >> In [4]: f.close()

>
> >> In [5]: print xml
> >> ------> print(xml)
> >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit

>
> >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> >> data=""/><longitude_e6 data=""/><forecast_date
> >> data="2009-10-17"/><current_date_time data="2009-10
> >> -17 14:20:00 +0000"/><unit_system
> >> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> >> umidity data="Feuchtigkeit: 87 %"/><icon
> >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> >> Windgeschwindigkeiten von 13 km/h"/></curr
> >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> >> data="1"/><high data="7"/><icon
> >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> >> data="So."/><low data="-1"/><high data="8"/><icon
> >> data="/ig/images/weather/chance_of_sno
> >> w.gif"/><condition data="Vereinzelt
> >> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> >> mages/weather/mostly_sunny.gif"/><condition data="Teils
> >> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> >> data="Di."/><low data="0"/><high data="8"
> >> /><icon data="/ig/images/weather/sunny.gif"/><condition
> >> data="Klar"/></forecast_conditions></weather></xml_api_reply>

>
> >> As you can see the umlauts in the XML are not displayed properly. When I want
> >> to process this text (for example with xml.sax), I get error messages because
> >> the parses can't read this.

>
> >> I've tried to read up on this and there is a lot of information on the web, but
> >> nothing seems to work for me. For example setting the coding to UTF like this:
> >> # -*- coding: utf-8 -*- or using the decode() string method.

>
> >> I always have this kind of problem when input contains umlauts, not just in
> >> this case. My locale (on Ubuntu) is en_GB.UTF-8.

>
> >> Cheers
> >> Arian

>
> > try this?

>
> > # vim: set fencoding=utf-8:
> > import urllib
> > import xml.sax as sax, xml.sax.handler as handler

>
> > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> > xml = f.read()
> > xml = xml.decode("cp1252")
> > f.close()

>
> > class my_handler(handler.ContentHandler):
> > * * def startElement(self, name, attrs):
> > * * * * print "begin:", name, attrs

>
> > * * def endElement(self, name):
> > * * * * print "end:", name

>
> > sax.parseString(xml, my_handler())

>
> This is wrong. XML is a *byte*-based format, which explicitly states
> encodings. So decoding a byte-string to a unicode-object and then
> passing it to a parser is not working in the very moment you have data that
>
> * - is outside your default-system-encoding (ususally ascii)
> * - the system-encoding and the declared decoding differ
>
> Besides, I don't see where the whole SAX-stuff is supposed to do
> anything the direct print *and the decode() don't do - smells like
> cargo-cult to me.
>
> Diez


yes, XML is a *byte*-based format, and so as utf-8 and code-page
(cp936, cp1252, etc.). so usually XML will sign its coding at head.
but this didn't work now.

in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
sys.setdefaultcoding(), and f.read() return a str. so it must be a
undecoded, byte-base format (i.e. raw XML data). so use the right code-
page to decode it is safe.(notice the webpage is google.de).

in Python3.1, read() returns a bytes object. so we *must* decode it,
nor we can't pass it into a parser.
 
Reply With Quote
 
Arian Kuschki
Guest
Posts: n/a
 
      10-17-2009
Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

>What does this show you in your interactive interpreter?
>
>>>> print "\xc3\xb6"


>
>For me, it's o-umlaut, ö. This is because the above bytes are the
>sequence for ö in utf-8.
>
>If this shows something else, you need to adjust your terminal settings.


for me it also prints the correct o-umlaut (ö), so that was not the problem.


All of the below result in xml that shows all umlauts correctly when printed:

xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")

But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Do I understand right that since the encoding was not specified in the xml
response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
would not have had the encoding problem in the first place?

Anyway, thanks everybody, this has helped me a lot.

Arian


On Sat 17, 20:17 +0200, Diez B. Roggisch wrote:

> StarWing schrieb:
> >On 10月18日, 上午12时50分, "Diez B. Roggisch" <(E-Mail Removed)> wrote:
> >>StarWing schrieb:
> >>
> >>
> >>
> >>>On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
> >>>wrote:
> >>>>Hi all
> >>>>this has been bugging me for a long time and I do not seem to be able to
> >>>>understand what to do. I always have problems when dealing input text that
> >>>>contains umlauts. Consider the following:
> >>>>In [1]: import urllib
> >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>>In [3]: xml = f.read()
> >>>>In [4]: f.close()
> >>>>In [5]: print xml
> >>>>------> print(xml)
> >>>><?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
> >>>>y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> >>>>data=""/><longitude_e6 data=""/><forecast_date
> >>>>data="2009-10-17"/><current_date_time data="2009-10
> >>>>-17 14:20:00 +0000"/><unit_system
> >>>>data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> >>>>bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> >>>>umidity data="Feuchtigkeit: 87 %"/><icon
> >>>>data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> >>>>Windgeschwindigkeiten von 13 km/h"/></curr
> >>>>ent_conditions><forecast_conditions><day_of_we ek data="Sa."/><low
> >>>>data="1"/><high data="7"/><icon
> >>>>data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> >>>>ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> >>>>data="So."/><low data="-1"/><high data="8"/><icon
> >>>>data="/ig/images/weather/chance_of_sno
> >>>>w.gif"/><condition data="Vereinzelt
> >>>>Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> >>>>data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> >>>>mages/weather/mostly_sunny.gif"/><condition data="Teils
> >>>>sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> >>>>data="Di."/><low data="0"/><high data="8"
> >>>>/><icon data="/ig/images/weather/sunny.gif"/><condition
> >>>>data="Klar"/></forecast_conditions></weather></xml_api_reply>
> >>>>As you can see the umlauts in the XML are not displayed properly. When I want
> >>>>to process this text (for example with xml.sax), I get error messages because
> >>>>the parses can't read this.
> >>>>I've tried to read up on this and there is a lot of information on the web, but
> >>>>nothing seems to work for me. For example setting the coding to UTF like this:
> >>>># -*- coding: utf-8 -*- or using the decode() string method.
> >>>>I always have this kind of problem when input contains umlauts, not just in
> >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8.
> >>>>Cheers
> >>>>Arian
> >>>try this?
> >>># vim: set fencoding=utf-8:
> >>>import urllib
> >>>import xml.sax as sax, xml.sax.handler as handler
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>xml = f.read()
> >>>xml = xml.decode("cp1252")
> >>>f.close()
> >>>class my_handler(handler.ContentHandler):
> >>> def startElement(self, name, attrs):
> >>> print "begin:", name, attrs
> >>> def endElement(self, name):
> >>> print "end:", name
> >>>sax.parseString(xml, my_handler())
> >>This is wrong. XML is a *byte*-based format, which explicitly states
> >>encodings. So decoding a byte-string to a unicode-object and then
> >>passing it to a parser is not working in the very moment you have data that
> >>
> >> - is outside your default-system-encoding (ususally ascii)
> >> - the system-encoding and the declared decoding differ
> >>
> >>Besides, I don't see where the whole SAX-stuff is supposed to do
> >>anything the direct print and the decode() don't do - smells like
> >>cargo-cult to me.
> >>
> >>Diez

> >
> >yes, XML is a *byte*-based format, and so as utf-8 and code-page
> >(cp936, cp1252, etc.). so usually XML will sign its coding at head.
> >but this didn't work now.
> >
> >in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
> >sys.setdefaultcoding(), and f.read() return a str. so it must be a
> >undecoded, byte-base format (i.e. raw XML data). so use the right code-
> >page to decode it is safe.(notice the webpage is google.de).
> >
> >in Python3.1, read() returns a bytes object. so we *must* decode it,
> >nor we can't pass it into a parser.

>
> You didn't get my point. A XML-parser only *takes* a byte-string.
> Decoding is it's business. So your above last sentence is wrong.
>
> Because regardless of the python-version, if you feed the parser a
> unicode-object, python will first encode that to a byte-string,
> possibly giving a UnicodeError (maybe this automated conversion has
> gone in Py3K, but then you get a type-error instead).
>
> So to make the above work (if one wants to parse the xml), the
> proper thing to do would be
>
> xml = xml.decode("cp1252").encode("utf-8")
>
> and then feed that. Of course the really good thing would be to fix
> the webpage, but that's beyond our capabilities I fear...
>
> Diez
> --
> http://mail.python.org/mailman/listinfo/python-list


--
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-17-2009
StarWing schrieb:
> On 10月18日, 上午12时50分, "Diez B. Roggisch" <(E-Mail Removed)> wrote:
>> StarWing schrieb:
>>
>>
>>
>>> On 10月17日, 下午9时54分, Arian Kuschki <(E-Mail Removed)>
>>> wrote:
>>>> Hi all
>>>> this has been bugging me for a long time and I do not seem to be able to
>>>> understand what to do. I always have problems when dealing input text that
>>>> contains umlauts. Consider the following:
>>>> In [1]: import urllib
>>>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>>> In [3]: xml = f.read()
>>>> In [4]: f.close()
>>>> In [5]: print xml
>>>> ------> print(xml)
>>>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>>>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>>>> data=""/><longitude_e6 data=""/><forecast_date
>>>> data="2009-10-17"/><current_date_time data="2009-10
>>>> -17 14:20:00 +0000"/><unit_system
>>>> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
>>>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>>>> umidity data="Feuchtigkeit: 87 %"/><icon
>>>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>>>> Windgeschwindigkeiten von 13 km/h"/></curr
>>>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>>>> data="1"/><high data="7"/><icon
>>>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>>>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
>>>> data="So."/><low data="-1"/><high data="8"/><icon
>>>> data="/ig/images/weather/chance_of_sno
>>>> w.gif"/><condition data="Vereinzelt
>>>> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
>>>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>>>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>>>> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
>>>> data="Di."/><low data="0"/><high data="8"
>>>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>>>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>>> As you can see the umlauts in the XML are not displayed properly. When I want
>>>> to process this text (for example with xml.sax), I get error messages because
>>>> the parses can't read this.
>>>> I've tried to read up on this and there is a lot of information on the web, but
>>>> nothing seems to work for me. For example setting the coding to UTF like this:
>>>> # -*- coding: utf-8 -*- or using the decode() string method.
>>>> I always have this kind of problem when input contains umlauts, not just in
>>>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>>> Cheers
>>>> Arian
>>> try this?
>>> # vim: set fencoding=utf-8:
>>> import urllib
>>> import xml.sax as sax, xml.sax.handler as handler
>>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>> xml = f.read()
>>> xml = xml.decode("cp1252")
>>> f.close()
>>> class my_handler(handler.ContentHandler):
>>> def startElement(self, name, attrs):
>>> print "begin:", name, attrs
>>> def endElement(self, name):
>>> print "end:", name
>>> sax.parseString(xml, my_handler())

>> This is wrong. XML is a *byte*-based format, which explicitly states
>> encodings. So decoding a byte-string to a unicode-object and then
>> passing it to a parser is not working in the very moment you have data that
>>
>> - is outside your default-system-encoding (ususally ascii)
>> - the system-encoding and the declared decoding differ
>>
>> Besides, I don't see where the whole SAX-stuff is supposed to do
>> anything the direct print and the decode() don't do - smells like
>> cargo-cult to me.
>>
>> Diez

>
> yes, XML is a *byte*-based format, and so as utf-8 and code-page
> (cp936, cp1252, etc.). so usually XML will sign its coding at head.
> but this didn't work now.
>
> in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
> sys.setdefaultcoding(), and f.read() return a str. so it must be a
> undecoded, byte-base format (i.e. raw XML data). so use the right code-
> page to decode it is safe.(notice the webpage is google.de).
>
> in Python3.1, read() returns a bytes object. so we *must* decode it,
> nor we can't pass it into a parser.


You didn't get my point. A XML-parser only *takes* a byte-string.
Decoding is it's business. So your above last sentence is wrong.

Because regardless of the python-version, if you feed the parser a
unicode-object, python will first encode that to a byte-string, possibly
giving a UnicodeError (maybe this automated conversion has gone in Py3K,
but then you get a type-error instead).

So to make the above work (if one wants to parse the xml), the proper
thing to do would be

xml = xml.decode("cp1252").encode("utf-8")

and then feed that. Of course the really good thing would be to fix the
webpage, but that's beyond our capabilities I fear...

Diez
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-18-2009
Arian Kuschki schrieb:
> Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate
>
>> What does this show you in your interactive interpreter?
>>
>>>>> print "\xc3\xb6"

>> ö
>>
>> For me, it's o-umlaut, ö. This is because the above bytes are the
>> sequence for ö in utf-8.
>>
>> If this shows something else, you need to adjust your terminal settings.

>
> for me it also prints the correct o-umlaut (ö), so that was not the problem.
>
>
> All of the below result in xml that shows all umlauts correctly when printed:
>
> xml.decode("cp1252")
> xml.decode("cp1252").encode("utf-8")
> xml.decode("iso-8859-1")
> xml.decode("iso-8859-1").encode("utf-8")
>
> But when I want to parse the xml then, it only works if I
> do both decode and encode. If I only decode, I get the following error:
> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
>
> Do I understand right that since the encoding was not specified in the xml
> response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
> would not have had the encoding problem in the first place?


Yes. XML without explicit encoding is implicitly UTF-8, and the page is
borked using cp* or latin* without saying so.


Diez
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-18-2009
Diez B. Roggisch schrieb:
> Arian Kuschki schrieb:
>> Whoa, that was quick! Thanks for all the answers, I'll try to
>> recapitulate
>>
>>> What does this show you in your interactive interpreter?
>>>
>>>>>> print "\xc3\xb6"
>>> ö
>>>
>>> For me, it's o-umlaut, ö. This is because the above bytes are the
>>> sequence for ö in utf-8.
>>>
>>> If this shows something else, you need to adjust your terminal settings.

>>
>> for me it also prints the correct o-umlaut (ö), so that was not the
>> problem.
>>
>>
>> All of the below result in xml that shows all umlauts correctly when
>> printed:
>>
>> xml.decode("cp1252")
>> xml.decode("cp1252").encode("utf-8")
>> xml.decode("iso-8859-1")
>> xml.decode("iso-8859-1").encode("utf-8")
>>
>> But when I want to parse the xml then, it only works if I
>> do both decode and encode. If I only decode, I get the following error:
>> SAXParseException: <unknown>:1:1: not well-formed (invalid token)
>>
>> Do I understand right that since the encoding was not specified in the
>> xml response, it should have been utf-8 by default? And that if it had
>> indeed been utf-8 I would not have had the encoding problem in the
>> first place?

>
> Yes. XML without explicit encoding is implicitly UTF-8, and the page is
> borked using cp* or latin* without saying so.


Ok, after reading some other posts in this thread this assumption seems
not to hold. HTTP-protocol allows for other encodings to be implicitly
given. Which I think is an atrocity.

Diez
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
characters with accent sor umlauts get written wrong John Dalberg ASP .Net 1 02-17-2006 07:00 PM
Converting Case - Umlauts? jose.jeria@gmail.com XML 10 11-01-2005 01:23 PM
Request.QueryString Collection Doesn't Get Umlauts Axel Dahmen ASP .Net 3 04-30-2005 03:12 PM
Where have all the umlauts gone? Moritz Beller C++ 1 11-07-2004 07:19 PM
Print formatted Strings with Umlauts Joerg Lehmann Python 4 02-12-2004 06:42 PM



Advertisments