Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   character encoding conversion (http://www.velocityreviews.com/forums/t339115-character-encoding-conversion.html)

Dylan 12-12-2004 01:28 AM

character encoding conversion
 

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I've searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/...991de6ced3406b
) . However, I am still unable to convert the characters to something
meaningful. In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I'm using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/




=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 12-12-2004 04:51 PM

Re: character encoding conversion
 
Dylan wrote:
> Things I have tried include encode()/decode()


This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin

Christian Ergh 12-12-2004 07:29 PM

Re: character encoding conversion
 
Martin v. Lwis wrote:
> Dylan wrote:
>
>> Things I have tried include encode()/decode()

>
>
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
>
> htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
>
> will give you a file that contains only ASCII characters, and
> character references for everything else.
>
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
> absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
> range(128,160)
> 6. use cp1252
> 7. use Latin-1
>
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> again.
>
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
>
> Regards,
> Martin

I have a similar problem, with characters like A and so on. I am
extracting some content out of webpages, and they deliver whatever,
sometimes not even giving any encoding information in the header. But
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 12-12-2004 10:13 PM

Re: character encoding conversion
 
Christian Ergh wrote:
> - it works with the characters i mentioned


It does.

> - what encoding do you have in the end


US-ASCII

> - and how exactly are you doing all this? All with somestring.decode()
> or... Can you please give an example for these 7 steps?


I could, but I don't have the time - just try to come up with some
code, and I try to comment on it.

Regards,
Martin

Christian Ergh 12-13-2004 08:41 AM

Re: character encoding conversion
 
Martin v. Lwis wrote:
> Dylan wrote:
>
>> Things I have tried include encode()/decode()

>
>
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
>
> htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
>
> will give you a file that contains only ASCII characters, and
> character references for everything else.
>
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
> absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
> range(128,160)
> 6. use cp1252
> 7. use Latin-1
>
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> again.
>
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
>
> Regards,
> Martin


Something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass:
data = data.encode("ascii", "xmlcharrefreplace")

Steven Bethard 12-13-2004 08:58 AM

Re: character encoding conversion
 
Christian Ergh wrote:
> flag = true
> for char in data:
> if 127 < ord(char) < 128:
> flag = false
> if flag:
> try:
> data = data.encode('latin-1')
> except:
> pass


A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:

for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass

Only saves you one line of code, but you don't have to keep track of a
'flag' variable. Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.

Steve

[1] Messed up indentation happens in a lot of clients if you have tabs
in your code. If you can replace tabs with spaces before posting, this
usually solves the problem.

Peter Otten 12-13-2004 09:09 AM

Re: character encoding conversion
 
Steven Bethard wrote:

> Christian Ergh wrote:
>> flag = true
>> for char in data:
>> if 127 < ord(char) < 128:
>> flag = false
>> if flag:
>> try:
>> data = data.encode('latin-1')
>> except:
>> pass

>
> A little OT, but (assuming I got your indentation right[1]) this kind of
> loop is exactly what the else clause of a for-loop is for:
>
> for char in data:
> if 127 < ord(char) < 128:
> break
> else:
> try:
> data = data.encode('latin-1')
> except:
> pass
>
> Only saves you one line of code, but you don't have to keep track of a
> 'flag' variable. Generally, I find that when I want to set a 'flag'
> variable, I can usually do it with a for/else instead.
>
> Steve
>
> [1] Messed up indentation happens in a lot of clients if you have tabs
> in your code. If you can replace tabs with spaces before posting, this
> usually solves the problem.


Even more off-topic:

>>> for char in data:

.... if 127 < ord(char) < 128:
.... break
....
>>> print char

127.5

:-)

Peter


Christian Ergh 12-13-2004 09:32 AM

Re: character encoding conversion
 
Peter Otten wrote:
> Steven Bethard wrote:
>
>
>>Christian Ergh wrote:
>>
>>>flag = true
>>>for char in data:
>>> if 127 < ord(char) < 128:
>>> flag = false
>>>if flag:
>>> try:
>>> data = data.encode('latin-1')
>>> except:
>>> pass

>>
>>A little OT, but (assuming I got your indentation right[1]) this kind of
>>loop is exactly what the else clause of a for-loop is for:
>>
>>for char in data:
>> if 127 < ord(char) < 128:
>> break
>>else:
>> try:
>> data = data.encode('latin-1')
>> except:
>> pass
>>
>>Only saves you one line of code, but you don't have to keep track of a
>>'flag' variable. Generally, I find that when I want to set a 'flag'
>>variable, I can usually do it with a for/else instead.
>>
>>Steve
>>
>>[1] Messed up indentation happens in a lot of clients if you have tabs
>>in your code. If you can replace tabs with spaces before posting, this
>>usually solves the problem.

>
>
> Even more off-topic:
>
>
>>>>for char in data:

>
> ... if 127 < ord(char) < 128:
> ... break
> ...
>
>>>>print char

>
> 127.5
>
> :-)
>
> Peter
>

Well yes, that happens when doing a quick hack and not reviewing it, 128
has to be 160 of course...

Christian Ergh 12-13-2004 09:37 AM

Re: character encoding conversion
 
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 160:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass
data = data.encode("ascii", "xmlcharrefreplace")


Max M 12-13-2004 09:48 AM

Re: character encoding conversion
 
Christian Ergh wrote:

A smiple way to try out different encodings in a given order:

# -*- coding: latin-1 -*-

def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass


st = 'Test characters '
encodings = ['utf-8', 'latin-1', 'ascii', ]
print get_encoded(st, encodings)

(u'Test characters \xe6\xf8\xe5 \xc6\xd8\xc5', 'latin-1')

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science


All times are GMT. The time now is 09:38 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.