![]() |
read from file with mixed encodings in Python3
Hello,
in Python3, I often have this problem: I want to do something with every line of a file. Like Python3, I presuppose that every line is encoded in utf-8. If this isn't the case, I would like Python3 to do something specific (like skipping the line, writing the line to standard error, ...) Like so: try: .... except UnicodeDecodeError: ... Yet, there is no place for this construction. If I simply do: for line in f: print(line) this will result in a UnicodeDecodeError if some line is not utf-8, but I can't tell Python3 to stop: This will not work: for line in f: try: print(line) except UnicodeDecodeError: ... because the UnicodeDecodeError is caused in the "for line in f"-part. How can I catch such exceptions? Note that recoding the file before opening it is not an option, because often files contain many different strings in many different encodings. Jaroslav |
Re: read from file with mixed encodings in Python3
On 11/07/2011 09:23 AM, Jaroslav Dobrek wrote:
> Hello, > > in Python3, I often have this problem: I want to do something with > every line of a file. Like Python3, I presuppose that every line is > encoded in utf-8. If this isn't the case, I would like Python3 to do > something specific (like skipping the line, writing the line to > standard error, ...) > > Like so: > > try: > .... > except UnicodeDecodeError: > ... > > Yet, there is no place for this construction. If I simply do: > > for line in f: > print(line) > > this will result in a UnicodeDecodeError if some line is not utf-8, > but I can't tell Python3 to stop: > > This will not work: > > for line in f: > try: > print(line) > except UnicodeDecodeError: > ... > > because the UnicodeDecodeError is caused in the "for line in f"-part. > > How can I catch such exceptions? > > Note that recoding the file before opening it is not an option, > because often files contain many different strings in many different > encodings. > > Jaroslav A file with mixed encodings isn't a text file. So open it with 'rb' mode, and use read() on it. Find your own line-endings, since a given '\n' byte may or may not be a line-ending. Once you've got something that looks like a line, explicitly decode it using utf-8. Some invalid lines will give an exception and some will not. But perhaps you've got some other gimmick to tell the encoding for each line. -- DaveA |
Re: read from file with mixed encodings in Python3
Jaroslav Dobrek wrote:
> Hello, > > in Python3, I often have this problem: I want to do something with > every line of a file. Like Python3, I presuppose that every line is > encoded in utf-8. If this isn't the case, I would like Python3 to do > something specific (like skipping the line, writing the line to > standard error, ...) > > Like so: > > try: > .... > except UnicodeDecodeError: > ... > > Yet, there is no place for this construction. If I simply do: > > for line in f: > print(line) > > this will result in a UnicodeDecodeError if some line is not utf-8, > but I can't tell Python3 to stop: > > This will not work: > > for line in f: > try: > print(line) > except UnicodeDecodeError: > ... > > because the UnicodeDecodeError is caused in the "for line in f"-part. > > How can I catch such exceptions? > > Note that recoding the file before opening it is not an option, > because often files contain many different strings in many different > encodings. I don't see those files often, but I think they are all seriously broken. There's no way to recover the information from files with unknown mixed encodings. However, here's an approach that may sometimes work: >>> with open("tmp.txt", "rb") as f: .... for line in f: .... try: .... line = "UTF-8 " + line.decode("utf-8") .... except UnicodeDecodeError: .... line = "Latin-1 " + line.decode("latin-1") .... print(line, end="") .... UTF-8 äöü Latin-1 äöü UTF-8 äöü |
| All times are GMT. The time now is 03:13 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.