Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Distinguishing cp850 and cp1252? (http://www.velocityreviews.com/forums/t324382-distinguishing-cp850-and-cp1252.html)

David Eppstein 11-03-2003 01:36 AM

Distinguishing cp850 and cp1252?
 
I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

John Roth 11-03-2003 02:35 AM

Re: Distinguishing cp850 and cp1252?
 

"David Eppstein" <eppstein@ics.uci.edu> wrote in message
news:eppstein-FD3246.17361302112003@news.service.uci.edu...
> I'm working on some Python code for reading files in a certain format,
> and the examples of such files I've found on the internet appear to be
> in either cp850 or cp1252 encoding (except for one exception for which I
> can't find a correct encoding among the standard Python ones).
>
> The file format itself includes nothing about which encoding is used,
> but only one of the two produces sensible results in the non-ascii
> examples I've seen.
>
> Is there an easy way of guessing with reasonable accuracy which of these
> two incodings was used for a particular file?


The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.

I know this doesn't help all that much, but it's the only thing
that has worked for me.

John Roth
>
> --
> David Eppstein http://www.ics.uci.edu/~eppstein/
> Univ. of California, Irvine, School of Information & Computer Science




=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 11-03-2003 02:37 AM

Re: Distinguishing cp850 and cp1252?
 
David Eppstein wrote:

> Is there an easy way of guessing with reasonable accuracy which of these
> two incodings was used for a particular file?


You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.

So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.

To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.

To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.

HTH,
Martin


David Eppstein 11-03-2003 05:47 AM

Re: Distinguishing cp850 and cp1252?
 
In article <vqbfqr373nfa0c@news.supernews.com>,
"John Roth" <newsgroups@jhrothjr.com> wrote:

> > Is there an easy way of guessing with reasonable accuracy which of these
> > two incodings was used for a particular file?

>
> The only way I know of is to do a statistical analysis on letter
> frequencies. To do that, you have to know your data fairly well.
> For example, CP850 has a number of characters devoted to box
> drawing characters. If your data doesn't involve drawing boxes,
> and you find those characters in the input, I'd say that's a strong
> clue that you're dealing with CP1252.


Thanks. After trying some other more hackish things which sort of
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
on a very simple statistical scheme: vote for how many times the
encoding produces unicodes that answer true to isalpha(). Seems to be
working...

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science


All times are GMT. The time now is 10:15 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.