Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   read from file with mixed encodings in Python3 (http://www.velocityreviews.com/forums/t805657-read-from-file-with-mixed-encodings-in-python3.html)

Jaroslav Dobrek 11-07-2011 02:23 PM

read from file with mixed encodings in Python3
 
Hello,

in Python3, I often have this problem: I want to do something with
every line of a file. Like Python3, I presuppose that every line is
encoded in utf-8. If this isn't the case, I would like Python3 to do
something specific (like skipping the line, writing the line to
standard error, ...)

Like so:

try:
....
except UnicodeDecodeError:
...

Yet, there is no place for this construction. If I simply do:

for line in f:
print(line)

this will result in a UnicodeDecodeError if some line is not utf-8,
but I can't tell Python3 to stop:

This will not work:

for line in f:
try:
print(line)
except UnicodeDecodeError:
...

because the UnicodeDecodeError is caused in the "for line in f"-part.

How can I catch such exceptions?

Note that recoding the file before opening it is not an option,
because often files contain many different strings in many different
encodings.

Jaroslav

Dave Angel 11-07-2011 02:33 PM

Re: read from file with mixed encodings in Python3
 
On 11/07/2011 09:23 AM, Jaroslav Dobrek wrote:
> Hello,
>
> in Python3, I often have this problem: I want to do something with
> every line of a file. Like Python3, I presuppose that every line is
> encoded in utf-8. If this isn't the case, I would like Python3 to do
> something specific (like skipping the line, writing the line to
> standard error, ...)
>
> Like so:
>
> try:
> ....
> except UnicodeDecodeError:
> ...
>
> Yet, there is no place for this construction. If I simply do:
>
> for line in f:
> print(line)
>
> this will result in a UnicodeDecodeError if some line is not utf-8,
> but I can't tell Python3 to stop:
>
> This will not work:
>
> for line in f:
> try:
> print(line)
> except UnicodeDecodeError:
> ...
>
> because the UnicodeDecodeError is caused in the "for line in f"-part.
>
> How can I catch such exceptions?
>
> Note that recoding the file before opening it is not an option,
> because often files contain many different strings in many different
> encodings.
>
> Jaroslav

A file with mixed encodings isn't a text file. So open it with 'rb'
mode, and use read() on it. Find your own line-endings, since a given
'\n' byte may or may not be a line-ending.

Once you've got something that looks like a line, explicitly decode it
using utf-8. Some invalid lines will give an exception and some will
not. But perhaps you've got some other gimmick to tell the encoding for
each line.

--

DaveA


Peter Otten 11-07-2011 02:42 PM

Re: read from file with mixed encodings in Python3
 
Jaroslav Dobrek wrote:

> Hello,
>
> in Python3, I often have this problem: I want to do something with
> every line of a file. Like Python3, I presuppose that every line is
> encoded in utf-8. If this isn't the case, I would like Python3 to do
> something specific (like skipping the line, writing the line to
> standard error, ...)
>
> Like so:
>
> try:
> ....
> except UnicodeDecodeError:
> ...
>
> Yet, there is no place for this construction. If I simply do:
>
> for line in f:
> print(line)
>
> this will result in a UnicodeDecodeError if some line is not utf-8,
> but I can't tell Python3 to stop:
>
> This will not work:
>
> for line in f:
> try:
> print(line)
> except UnicodeDecodeError:
> ...
>
> because the UnicodeDecodeError is caused in the "for line in f"-part.
>
> How can I catch such exceptions?
>
> Note that recoding the file before opening it is not an option,
> because often files contain many different strings in many different
> encodings.


I don't see those files often, but I think they are all seriously broken.
There's no way to recover the information from files with unknown mixed
encodings. However, here's an approach that may sometimes work:

>>> with open("tmp.txt", "rb") as f:

.... for line in f:
.... try:
.... line = "UTF-8 " + line.decode("utf-8")
.... except UnicodeDecodeError:
.... line = "Latin-1 " + line.decode("latin-1")
.... print(line, end="")
....
UTF-8 äöü
Latin-1 äöü
UTF-8 äöü




All times are GMT. The time now is 03:13 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57