![]() |
Detect character encoding
Hello,
is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). Thank you for any answer Regards Michal |
Re: Detect character encoding
Michal wrote:
> Hello, > is there any way how to detect string encoding in Python? > > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). > > Thank you for any answer > Regards > Michal The two ways to detect a string's encoding are: (1) know the encoding ahead of time (2) guess correctly This is the whole point of Unicode -- an encoding that works for _lots_ of languages. --Scott David Daniels scott.daniels@acm.org |
Re: Detect character encoding
Michal wrote:
> Hello, > is there any way how to detect string encoding in Python? > > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). You can only guess, by e.g. looking for words that contain e.g. umlauts. Recode might be of help here, it has such heuristics built in AFAIK. But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file is "legal" in all encodings. Diez |
Re: Detect character encoding
"Diez B. Roggisch" <deets@nospam.web.de> writes:
> Michal wrote: >> is there any way how to detect string encoding in Python? >> I need to proccess several files. Each of them could be encoded in >> different charset (iso-8859-2, cp1250, etc). I want to detect it, >> and encode it to utf-8 (with string function encode). > But there is _no_ way to be absolutely sure. 8bit are 8bit, so each > file is "legal" in all encodings. Not quite. Some encodings don't use all the valid 8-bit characters, so if you encounter a character not in an encoding, you can eliminate it from the list of possible encodings. This doesn't really help much by itself, though. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. |
Re: Detect character encoding
Mentre io pensavo ad una intro simpatica "Michal" scriveva:
> Hello, > is there any way how to detect string encoding in Python? > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). > Thank you for any answer Hi, As you already heard you can't be sure but you can guess. I use a method like this: def guess_encoding(text): for best_enc in guess_list: try: unicode(text,best_enc,"strict") except: pass else: break return best_enc 'guess_list' is an ordered charset name list like this: ['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...] of course you can remove charsets you are sure you'll never find. -- Questa potrebbe davvero essere la scintilla che fa traboccare la goccia. |\ | |HomePage : http://nem01.altervista.org | \|emesis |XPN (my nr): http://xpn.altervista.org |
Re: Detect character encoding
You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Coo...n/Recipe/52257 "Auto-detect XML encoding" by Paul Prescod |
Re: Detect character encoding
Mike Meyer wrote:
> "Diez B. Roggisch" <deets@nospam.web.de> writes: >> Michal wrote: >>> is there any way how to detect string encoding in Python? >>> I need to proccess several files. Each of them could be encoded in >>> different charset (iso-8859-2, cp1250, etc). I want to detect it, >>> and encode it to utf-8 (with string function encode). >> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each >> file is "legal" in all encodings. > > Not quite. Some encodings don't use all the valid 8-bit characters, so > if you encounter a character not in an encoding, you can eliminate it > from the list of possible encodings. This doesn't really help much by > itself, though. > > <mike I read or heard (can't remember the origin) that MS IE has a quite good implementation of guessing the language en character encoding of web pages when there not or falsely specified. From what I can remember is that they used an algorithm to create some statistics of the specific page and compared that with statistic about all kinds of languages and encodings and just mapped the most likely. Please be aware that I don't know if the above has even the slightest amount of truth in it, however it didn't prevent me from posting anyway ;-) -- mph |
Re: Detect character encoding
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character Martin> encoding of web pages when there not or falsely specified. Gee, that's nice. Too bad the source isn't available... <0.5 wink> Skip |
Re: Detect character encoding
Mike Meyer wrote:
> "Diez B. Roggisch" <deets@nospam.web.de> writes: > >>Michal wrote: >> >>>is there any way how to detect string encoding in Python? >>>I need to proccess several files. Each of them could be encoded in >>>different charset (iso-8859-2, cp1250, etc). I want to detect it, >>>and encode it to utf-8 (with string function encode). >> >>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each >>file is "legal" in all encodings. > > > Not quite. Some encodings don't use all the valid 8-bit characters, so > if you encounter a character not in an encoding, you can eliminate it > from the list of possible encodings. This doesn't really help much by > itself, though. ----- test.py for enc in ["cp1250", "latin1", "iso-8859-2"]: print enc try: str.decode("".join([chr(i) for i in xrange(256)]), enc) except UnicodeDecodeError, e: print e ----- 192:~ deets$ python2.4 /tmp/test.py cp1250 'charmap' codec can't decode byte 0x81 in position 129: character maps to <undefined> latin1 iso-8859-2 So cp1250 doesn't have all codepoints defined - but the others have. Sure, this helps you to eliminate 1 of the three choices the OP wanted to choose between - but how many texts you have that have a 129 in them? Regards, Diez |
Re: Detect character encoding
[Diez B. Roggisch]
>Michal wrote: >> is there any way how to detect string encoding in Python? >Recode might be of help here, it has such heuristics built in AFAIK. If we are speaking about the same Recode ☺, there are some built in tools that could help a human to discover a charset, but this requires work and time, and is far from fully automated as one might dream. While some charsets could be guessed almost correctly by automatic means, most are difficult to recognise. The whole problem is not easy. -- François Pinard http://pinard.progiciels-bpi.ca |
| All times are GMT. The time now is 01:02 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.