Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Python3: Sane way to deal with broken encodings

Reply
Thread Tools

Python3: Sane way to deal with broken encodings

 
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      12-06-2009
Johannes Bauer a écrit :
> Dear all,
>
> I've some applciations which fetch HTML docuemnts off the web, parse
> their content and do stuff with it. Every once in a while it happens
> that the web site administrators put up files which are encoded in a
> wrong manner.
>
> Thus my Python script dies a horrible death:
>
> File "./update_db", line 67, in <module>
> for line in open(tempfile, "r"):
> File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
> 3286: unexpected code byte
>
> This is well and ok usually, but I'd like to be able to tell Python:
> "Don't worry, some idiot encoded that file, just skip over such
> parts/replace them by some character sequence".
>
> Is that possible? If so, how?


This might get you started:

"""
>>> help(str.decode)

decode(...)
S.decode([encoding[,errors]]) -> object

Decodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that is
able to handle UnicodeDecodeErrors.
"""

HTH
 
Reply With Quote
 
 
 
 
Johannes Bauer
Guest
Posts: n/a
 
      12-06-2009
Dear all,

I've some applciations which fetch HTML docuemnts off the web, parse
their content and do stuff with it. Every once in a while it happens
that the web site administrators put up files which are encoded in a
wrong manner.

Thus my Python script dies a horrible death:

File "./update_db", line 67, in <module>
for line in open(tempfile, "r"):
File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

Kind regards,
Johannes

--
"Aus starken Potentialen können starke Erdbeben resultieren; es können
aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
(!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
-- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
<(E-Mail Removed)>
 
Reply With Quote
 
 
 
 
Johannes Bauer
Guest
Posts: n/a
 
      12-07-2009
Bruno Desthuilliers schrieb:

>> Is that possible? If so, how?

>
> This might get you started:
>
> """
>>>> help(str.decode)

> decode(...)
> S.decode([encoding[,errors]]) -> object


Hmm, this would work nicely if I called "decode" explicitly - but what
I'm doing is:

#!/usr/bin/python3
for line in open("broken", "r"):
pass

Which still raises the UnicodeDecodeError when I do not even do any
decoding explicitly. How can I achieve this?

Kind regards,
Johannes

--
"Aus starken Potentialen können starke Erdbeben resultieren; es können
aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
(!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
-- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
<(E-Mail Removed)>
 
Reply With Quote
 
Benjamin Kaplan
Guest
Posts: n/a
 
      12-07-2009
On Mon, Dec 7, 2009 at 2:16 PM, Johannes Bauer <(E-Mail Removed)> wrote:
> Bruno Desthuilliers schrieb:
>
>>> Is that possible? If so, how?

>>
>> This might get you started:
>>
>> """
>>>>> help(str.decode)

>> decode(...)
>> * * S.decode([encoding[,errors]]) -> object

>
> Hmm, this would work nicely if I called "decode" explicitly - but what
> I'm doing is:
>
> #!/usr/bin/python3
> for line in open("broken", "r"):
> * * * *pass
>
> Which still raises the UnicodeDecodeError when I do not even do any
> decoding explicitly. How can I achieve this?
>
> Kind regards,
> Johannes
>


Looking at the python 3 docs, it seems that open takes the encoding
and errors parameters as optional arguments. So you can call
open('broken', 'r',errors='replace')

> --
> "Aus starken Potentialen können starke Erdbeben resultieren; es können
> aber auch kleine entstehen - und "du" wirst es nicht für möglich halten
> (!), doch sieh': Es können dabei auch gar keine Erdbeben resultieren."
> -- "Rüdiger Thomas" alias Thomas Schulz in dsa über seine "Vorhersagen"
> <(E-Mail Removed)>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

 
Reply With Quote
 
Martin v. Loewis
Guest
Posts: n/a
 
      12-08-2009
> Thus my Python script dies a horrible death:
>
> File "./update_db", line 67, in <module>
> for line in open(tempfile, "r"):
> File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
> 3286: unexpected code byte
>
> This is well and ok usually, but I'd like to be able to tell Python:
> "Don't worry, some idiot encoded that file, just skip over such
> parts/replace them by some character sequence".
>
> Is that possible? If so, how?


As Benjamin says: if you pass errors='replace' to open, then it will
replace the faulty characters; if you pass errors='ignore', it will
skip over them.

Alternatively, you can open the files in binary ('rb'), so that no
decoding will be attempted at all, or you can specify latin-1 as
the encoding, which means that you can decode all files successfully
(though possibly not correctly).

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD64 or Semperon, deal or no deal? Tad Confused Computer Information 7 04-13-2006 05:43 PM
SANE 2006 Registration Information and Early Bird Deadline Edwin Kremer Computer Security 0 02-19-2006 09:17 PM
deal or no deal rbt Python 7 12-28-2005 08:57 PM
SANE 2006 refereed paper submission deadline is October 24, 2005 Edwin Kremer Computer Security 0 09-22-2005 09:14 PM
SANE 2006: Announcement and Call for Papers Edwin Kremer Computer Security 0 07-08-2005 04:22 PM



Advertisments