Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: umlauts (http://www.velocityreviews.com/forums/t701892-re-umlauts.html)

MRAB 10-17-2009 04:14 PM

Re: umlauts
 
Arian Kuschki wrote:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>> <forecast_information><cit

> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87�%"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.
>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>

The string you received from the website is a bytestring and you're just
printing it to your console, which is configured for UTF-8. However, the
bytestring isn't valid UTF-8, so the console is replacing the invalid
parts with the funny characters.

You should decode the bytestring to Unicode and then re-encode it to
UTF-8. I don't know what encoding the website is actually using; here
I'm assuming ISO-8859-1:

print xml.decode("iso-8859-1").encode("utf-8")

Diez B. Roggisch 10-17-2009 04:54 PM

Re: umlauts
 
MRAB schrieb:
> Arian Kuschki wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able
>> to understand what to do. I always have problems when dealing input
>> text that contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f =
>> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>>> <forecast_information><cit

>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><conditio n
>> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87�%"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
>> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt
>> Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When
>> I want to process this text (for example with xml.sax), I get error
>> messages because the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the
>> web, but nothing seems to work for me. For example setting the coding
>> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
>> method.
>>
>> I always have this kind of problem when input contains umlauts, not
>> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>

> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.


This is wierd. I looked at the site in FireFox - and it was displayed
correctly, including umlauts. Bringing up the info-dialog claims the
page is UTF-8, the XML itself says so as well (implicit, through the
missing declaration of an encoding) - but it clearly is *not* utf-8.

One would expect google to be better at this...

Diez

Diez B. Roggisch 10-17-2009 04:54 PM

Re: umlauts
 
MRAB schrieb:
> Arian Kuschki wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able
>> to understand what to do. I always have problems when dealing input
>> text that contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f =
>> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>>> <forecast_information><cit

>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><conditio n
>> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87�%"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition
>> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt
>> Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When
>> I want to process this text (for example with xml.sax), I get error
>> messages because the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the
>> web, but nothing seems to work for me. For example setting the coding
>> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string
>> method.
>>
>> I always have this kind of problem when input contains umlauts, not
>> just in this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>

> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.


This is wierd. I looked at the site in FireFox - and it was displayed
correctly, including umlauts. Bringing up the info-dialog claims the
page is UTF-8, the XML itself says so as well (implicit, through the
missing declaration of an encoding) - but it clearly is *not* utf-8.

One would expect google to be better at this...

Diez

StarWing 10-17-2009 04:55 PM

Re: umlauts
 
On 10月18日, 上午12时14分, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Arian Kuschki wrote:
> > Hi all

>
> > this has been bugging me for a long time and I do not seem to be able to
> > understand what to do. I always have problems when dealing input text that
> > contains umlauts. Consider the following:

>
> > In [1]: import urllib

>
> > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

>
> > In [3]: xml = f.read()

>
> > In [4]: f.close()

>
> > In [5]: print xml
> > ------> print(xml)
> > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
> >> <forecast_information><cit

> > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> > data=""/><longitude_e6 data=""/><forecast_date
> > data="2009-10-17"/><current_date_time data="2009-10
> > -17 14:20:00 +0000"/><unit_system
> > data="SI"/></forecast_information><current_conditions><conditio n data="Meistens
> > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> > umidity data="Feuchtigkeit: 87 %"/><icon
> > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> > Windgeschwindigkeiten von 13 km/h"/></curr
> > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> > data="1"/><high data="7"/><icon
> > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_w eek
> > data="So."/><low data="-1"/><high data="8"/><icon
> > data="/ig/images/weather/chance_of_sno
> > w.gif"/><condition data="Vereinzelt
> > Schnee"/></forecast_conditions><forecast_conditions><day_of_w eek
> > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> > mages/weather/mostly_sunny.gif"/><condition data="Teils
> > sonnig"/></forecast_conditions><forecast_conditions><day_of_w eek
> > data="Di."/><low data="0"/><high data="8"
> > /><icon data="/ig/images/weather/sunny.gif"/><condition
> > data="Klar"/></forecast_conditions></weather></xml_api_reply>

>
> > As you can see the umlauts in the XML are not displayed properly. When I want
> > to process this text (for example with xml.sax), I get error messages because
> > the parses can't read this.

>
> > I've tried to read up on this and there is a lot of information on the web, but
> > nothing seems to work for me. For example setting the coding to UTF like this:
> > # -*- coding: utf-8 -*- or using the decode() string method.

>
> > I always have this kind of problem when input contains umlauts, not just in
> > this case. My locale (on Ubuntu) is en_GB.UTF-8.

>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.
>
> You should decode the bytestring to Unicode and then re-encode it to
> UTF-8. I don't know what encoding the website is actually using; here
> I'm assuming ISO-8859-1:
>
> print xml.decode("iso-8859-1").encode("utf-8")


in 2.6, str.decode return unicode, so you can directly print it.
in 3.1, str.encode return bytes, so you can also directly print it.

so, just decode("cp1252"), it's enough.

Arian Kuschki 10-17-2009 05:54 PM

Re: umlauts
 
I just checked and I see the following in the headers:
Content-Type text/xml; charset=UTF-8

Where does it say ISO-8859-1?

On Sat 17, 20:57 +0200, I V wrote:

> On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote:
>
> > This is wierd. I looked at the site in FireFox - and it was displayed
> > correctly, including umlauts. Bringing up the info-dialog claims the
> > page is UTF-8, the XML itself says so as well (implicit, through the
> > missing declaration of an encoding) - but it clearly is *not* utf-8.

>
> The headers correctly identify it as ISO-8859-1, which overrides the
> implicit specification of UTF-8. I'm not sure why Firefox is reporting it
> as UTF-8 (it does that for me, too); I can see the umlauts, so it's
> clearly processing it as ISO-8859-1.
> --
> http://mail.python.org/mailman/listinfo/python-list


--

Arian Kuschki 10-17-2009 07:07 PM

Re: umlauts
 
Hm yes, that is true. In Firefox on the other hand, the response header is
"Content-Type text/xml; charset=UTF-8"

On Sat 17, 13:16 -0700, Mark Tolonen wrote:

>
> "Diez B. Roggisch" <deets@nospam.web.de> wrote in message
> news:7jub5rF37divlU4@mid.uni-berlin.de...
> [snip]
> >This is wierd. I looked at the site in FireFox - and it was
> >displayed correctly, including umlauts. Bringing up the
> >info-dialog claims the page is UTF-8, the XML itself says so as
> >well (implicit, through the missing declaration of an encoding) -
> >but it clearly is *not* utf-8.
> >
> >One would expect google to be better at this...
> >
> >Diez

>
> According to the XML 1.0 specification:
>
> "Although an XML processor is required to read only entities in the
> UTF-8 and UTF-16 encodings, it is recognized that other encodings
> are used around the world, and it may be desired for XML processors
> to read entities that use them. In the absence of external character
> encoding information (such as MIME headers), parsed entities which
> are stored in an encoding other than UTF-8 or UTF-16 must begin with
> a text declaration..."
>
> So UTF-8 and UTF-16 are the defaults supported without an xml
> declaration in the absence of external encoding information. But we
> have external character encoding information:
>
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> >>>f.headers.dict['content-type']

> 'text/xml; charset=ISO-8859-1'
>
> So the page seems correct.
>
> -Mark
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list


--

Mark Tolonen 10-17-2009 08:16 PM

Re: umlauts
 

"Diez B. Roggisch" <deets@nospam.web.de> wrote in message
news:7jub5rF37divlU4@mid.uni-berlin.de...
[snip]
> This is wierd. I looked at the site in FireFox - and it was displayed
> correctly, including umlauts. Bringing up the info-dialog claims the page
> is UTF-8, the XML itself says so as well (implicit, through the missing
> declaration of an encoding) - but it clearly is *not* utf-8.
>
> One would expect google to be better at this...
>
> Diez


According to the XML 1.0 specification:

"Although an XML processor is required to read only entities in the UTF-8
and UTF-16 encodings, it is recognized that other encodings are used around
the world, and it may be desired for XML processors to read entities that
use them. In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other than
UTF-8 or UTF-16 must begin with a text declaration..."

So UTF-8 and UTF-16 are the defaults supported without an xml declaration in
the absence of external encoding information. But we have external
character encoding information:

>>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>> f.headers.dict['content-type']

'text/xml; charset=ISO-8859-1'

So the page seems correct.

-Mark



Neil Hodgson 10-17-2009 09:43 PM

Re: umlauts
 
The server is sniffing the User-Agent header to decide whether to
send UTF-8 or ISO-8859-1. Try this code:

import urllib2
r = urllib2.Request("http://www.google.de/ig/api?weather=Muenchen",
None, {"User-Agent":"Mozilla/5.0"})
f = urllib2.urlopen(r)
i = f.info()
print(i)
xml = f.read()
f.close()
print(xml)

Neil


All times are GMT. The time now is 11:21 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.