Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > catch UnicodeDecodeError

Reply
Thread Tools

catch UnicodeDecodeError

 
 
jaroslav.dobrek@gmail.com
Guest
Posts: n/a
 
      07-25-2012
Hello,

very often I have the following problem: I write a program that processes many files which it assumes to be encoded in utf-8. Then, some day, I there is a non-utf-8 character in one of several hundred or thousand (new) files.The program exits with an error message like this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 60: invalid continuation byte

I usually solve the problem by moving files around and by recoding them.

What I really want to do is use something like

try:
# open file, read line, or do something else, I don't care
except UnicodeDecodeError:
sys.exit("Found a bad char in file " + file + " line " + str(line_number)

Yet, no matter where I put this try-except, it doesn't work.

How should I use try-except with UnicodeDecodeError?

Jaroslav
 
Reply With Quote
 
 
 
 
Andrew Berg
Guest
Posts: n/a
 
      07-25-2012
On 7/25/2012 6:05 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> What I really want to do is use something like
>
> try:
> # open file, read line, or do something else, I don't care
> except UnicodeDecodeError:
> sys.exit("Found a bad char in file " + file + " line " + str(line_number)
>
> Yet, no matter where I put this try-except, it doesn't work.
>
> How should I use try-except with UnicodeDecodeError?

The same way you handle any other exception. The traceback will tell you
the exact line that raised the exception. It helps us help you if you
include the full traceback and give more detail than "it doesn't work".

--
CPython 3.3.0b1 | Windows NT 6.1.7601.17803
 
Reply With Quote
 
 
 
 
Philipp Hagemeister
Guest
Posts: n/a
 
      07-25-2012
Hi Jaroslav,

you can catch a UnicodeDecodeError just like any other exception. Can
you provide a full example program that shows your problem?

This works fine on my system:


import sys
open('tmp', 'wb').write(b'\xff\xff')
try:
buf = open('tmp', 'rb').read()
buf.decode('utf-8')
except UnicodeDecodeError as ude:
sys.exit("Found a bad char in file " + "tmp")


Note that you cannot possibly determine the line number if you don't
know what encoding the file is in (and what EOL it uses).

What you can do is count the number of bytes with the value 10 before
ude.start, like this:

lineGuess = buf[:ude.start].count(b'\n') + 1

- Philipp

On 07/25/2012 01:05 PM, (E-Mail Removed) wrote:
> it doesn't work



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEAREKAAYFAlAP2e8ACgkQ9eq1gvr7CFxjIgCfZDryZu+HIQ l4wSfH62sAEJl/
IlgAoJUqLDDWYZREqYe9O5PKYdlsMBki
=cGOq
-----END PGP SIGNATURE-----

 
Reply With Quote
 
jaroslav.dobrek@gmail.com
Guest
Posts: n/a
 
      07-25-2012
On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> Hi Jaroslav,
>
> you can catch a UnicodeDecodeError just like any other exception. Can
> you provide a full example program that shows your problem?
>
> This works fine on my system:
>
>
> import sys
> open('tmp', 'wb').write(b'\xff\xff')
> try:
> buf = open('tmp', 'rb').read()
> buf.decode('utf-8')
> except UnicodeDecodeError as ude:
> sys.exit("Found a bad char in file " + "tmp&quot
>


Thank you. I got it. What I need to do is explicitly decode text.

But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.

What I am missing (especially for Python3) is something like:

try:
for line in sys.stdin:
except UnicodeDecodeError:
sys.exit("Encoding problem in line " + str(line_number))

I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
 
Reply With Quote
 
jaroslav.dobrek@gmail.com
Guest
Posts: n/a
 
      07-25-2012
On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> Hi Jaroslav,
>
> you can catch a UnicodeDecodeError just like any other exception. Can
> you provide a full example program that shows your problem?
>
> This works fine on my system:
>
>
> import sys
> open('tmp', 'wb').write(b'\xff\xff')
> try:
> buf = open('tmp', 'rb').read()
> buf.decode('utf-8')
> except UnicodeDecodeError as ude:
> sys.exit("Found a bad char in file " + "tmp&quot
>


Thank you. I got it. What I need to do is explicitly decode text.

But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.

What I am missing (especially for Python3) is something like:

try:
for line in sys.stdin:
except UnicodeDecodeError:
sys.exit("Encoding problem in line " + str(line_number))

I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
 
Reply With Quote
 
Dave Angel
Guest
Posts: n/a
 
      07-25-2012
On 07/25/2012 08:09 AM, (E-Mail Removed) wrote:
> On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
>> Hi Jaroslav,
>>
>> you can catch a UnicodeDecodeError just like any other exception. Can
>> you provide a full example program that shows your problem?
>>
>> This works fine on my system:
>>
>>
>> import sys
>> open('tmp', 'wb').write(b'\xff\xff')
>> try:
>> buf = open('tmp', 'rb').read()
>> buf.decode('utf-8')
>> except UnicodeDecodeError as ude:
>> sys.exit("Found a bad char in file " + "tmp&quot
>>

> Thank you. I got it. What I need to do is explicitly decode text.
>
> But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.
>
> What I am missing (especially for Python3) is something like:
>
> try:
> for line in sys.stdin:
> except UnicodeDecodeError:
> sys.exit("Encoding problem in line " + str(line_number))
>
> I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.


i can't understand your question. if the problem is that the system
doesn't magically produce a variable called line_number, then generate
it yourself, by counting
in the loop.

Don't forget that you can tell the unicode decoder to ignore bad
characters, or to convert them to a specified placeholder.



--

DaveA

 
Reply With Quote
 
Jaroslav Dobrek
Guest
Posts: n/a
 
      07-26-2012
On Jul 25, 8:50*pm, Dave Angel <(E-Mail Removed)> wrote:
> On 07/25/2012 08:09 AM, (E-Mail Removed) wrote:
>
>
>
>
>
>
>
>
>
> > On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> >> Hi Jaroslav,

>
> >> you can catch a UnicodeDecodeError just like any other exception. Can
> >> you provide a full example program that shows your problem?

>
> >> This works fine on my system:

>
> >> import sys
> >> open('tmp', 'wb').write(b'\xff\xff')
> >> try:
> >> * * buf = open('tmp', 'rb').read()
> >> * * buf.decode('utf-8')
> >> except UnicodeDecodeError as ude:
> >> * * sys.exit(&quot;Found a bad char in file &quot; + &quot;tmp&quot

>
> > Thank you. I got it. What I need to do is explicitly decode text.

>
> > But I think trial and error with moving files around will in most casesbe faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.

>
> > What I am missing (especially for Python3) is something like:

>
> > try:
> > * * for line in sys.stdin:
> > except UnicodeDecodeError:
> > * * sys.exit("Encoding problem in line " + str(line_number))

>
> > I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one singleline.

>
> i can't understand your question. *if the problem is that the system
> doesn't magically produce a variable called line_number, then generate
> it yourself, by counting
> in the loop.



That was just a very incomplete and general example.

My problem is solved. What I need to do is explicitly decode text when
reading it. Then I can catch exceptions. I might do this in future
programs.

I dislike about this solution that it complicates most programs
unnecessarily. In programs that open, read and process many files I
don't want to explicitly decode and encode characters all the time. I
just want to write:

for line in f:

or something like that. Yet, writing this means to *implicitly* decode
text. And, because the decoding is implicit, you cannot say

try:
for line in f: # here text is decoded implicitly
do_something()
except UnicodeDecodeError():
do_something_different()

This isn't possible for syntactic reasons.

The problem is that vast majority of the thousands of files that I
process are correctly encoded. But then, suddenly, there is a bad
character in a new file. (This is so because most files today are
generated by people who don't know that there is such a thing as
encodings.) And then I need to rewrite my very complex program just
because of one single character in one single file.
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      07-26-2012
Jaroslav Dobrek, 26.07.2012 09:46:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.


Yes, that's the standard procedure. Decode on the way in, encode on the way
out, use Unicode everywhere in between.


> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
>
> for line in f:


And the cool thing is: you can!

In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:

from io import open

filename = "somefile.txt"
try:
with open(filename, encoding="utf-8") as f:
for line in f:
process_line(line) # actually, I'd use "process_file(f)"
except IOError, e:
print("Reading file %s failed: %s" % (filename, e))
except UnicodeDecodeError, e:
print("Some error occurred decoding file %s: %s" % (filename, e))


Ok, maybe with a better way to handle the errors than "print" ...

For older Python versions, you'd use "codecs.open()" instead. That's a bit
messy, but only because it was finally cleaned up for Python 3.


> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
>
> try:
> for line in f: # here text is decoded implicitly
> do_something()
> except UnicodeDecodeError():
> do_something_different()
>
> This isn't possible for syntactic reasons.


Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.


> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.


Why would that be the case? The places to change should be very local in
your code.

Stefan


 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      07-26-2012
On Thu, Jul 26, 2012 at 5:46 PM, Jaroslav Dobrek
<(E-Mail Removed)> wrote:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.


Apologies if it's already been said (I'm only skimming this thread),
but ISTM that you want to open the file in binary mode. You'll then
get back a bytes() instead of a str(), and you can attempt to decode
it separately. You may then need to do your own division into lines
that way, though.

ChrisA
 
Reply With Quote
 
wxjmfauth@gmail.com
Guest
Posts: n/a
 
      07-26-2012
On Thursday, July 26, 2012 9:46:27 AM UTC+2, Jaroslav Dobrek wrote:
> On Jul 25, 8:50*pm, Dave Angel &lt;(E-Mail Removed)&gt; wrote:
> &gt; On 07/25/2012 08:09 AM, (E-Mail Removed) wrote:
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; &gt; On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> &gt; &gt;&gt; Hi Jaroslav,
> &gt;
> &gt; &gt;&gt; you can catch a UnicodeDecodeError just like any other exception. Can
> &gt; &gt;&gt; you provide a full example program that shows your problem?
> &gt;
> &gt; &gt;&gt; This works fine on my system:
> &gt;
> &gt; &gt;&gt; import sys
> &gt; &gt;&gt; open(&amp;#39;tmp&amp;#39;, &amp;#39;wb&amp;#39.write(b&amp;#39;\xff\xff&amp ;#39
> &gt; &gt;&gt; try:
> &gt; &gt;&gt; * * buf = open(&amp;#39;tmp&amp;#39;, &amp;#39;rb&amp;#39.read()
> &gt; &gt;&gt; * * buf.decode(&amp;#39;utf-8&amp;#39
> &gt; &gt;&gt; except UnicodeDecodeError as ude:
> &gt; &gt;&gt; * * sys.exit(&amp;quot;Found a bad char in file &amp;quot; + &amp;quot;tmp&amp;quot
> &gt;
> &gt; &gt; Thank you. I got it. What I need to do is explicitly decode text.
> &gt;
> &gt; &gt; But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.
> &gt;
> &gt; &gt; What I am missing (especially for Python3) is something like:
> &gt;
> &gt; &gt; try:
> &gt; &gt; * * for line in sys.stdin:
> &gt; &gt; except UnicodeDecodeError:
> &gt; &gt; * * sys.exit(&quot;Encoding problem in line &quot; + str(line_number))
> &gt;
> &gt; &gt; I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
> &gt;
> &gt; i can't understand your question. *if the problem is that the system
> &gt; doesn't magically produce a variable called line_number, then generate
> &gt; it yourself, by counting
> &gt; in the loop.
>
>
> That was just a very incomplete and general example.
>
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.
>
> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
>
> for line in f:
>
> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
>
> try:
> for line in f: # here text is decoded implicitly
> do_something()
> except UnicodeDecodeError():
> do_something_different()
>
> This isn't possible for syntactic reasons.
>
> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.


In my mind you are taking the problem the wrong way.

Basically there is no "real UnicodeDecodeError", you are
just wrongly attempting to read a file with the wrong
codec. Catching a UnicodeDecodeError will not correct
the basic problem, it will "only" show, you are using
a wrong codec.
There is still the possibility, you have to deal with an
ill-formed utf-8 codding, but I doubt it is the case.

Do not forget, a "bit of text" has only a meaning if you
know its coding.

In short, all your files are most probably ok, you do not read
them correctly.

>>> b'abc\xeadef'.decode('utf-8')

Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in
position 3: invalid continuation byte
>>> # but
>>> b'abc\xeadef'.decode('cp1252')

'abcdef'
>>> b'abc\xeadef'.decode('mac-roman')

'abcdef'
>>> b'abc\xeadef'.decode('iso-8859-1')

'abcdef'

jmf
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: catch UnicodeDecodeError Jaroslav Dobrek Python 0 07-26-2012 11:04 AM
catch doesn't catch a thrown exception Marteno Rodia Java 5 08-05-2009 03:30 AM
catch(...) doesn't catch everything Adam C++ 9 02-02-2006 05:02 PM
minidom's setAttribute + UnicodeDecodeError Ruslan Python 1 09-07-2004 08:33 PM
why catch (...) can not catch such exception John Black C++ 8 08-20-2004 02:34 PM



Advertisments