Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > latin1 and cp1252 inconsistent?

Reply
Thread Tools

latin1 and cp1252 inconsistent?

 
 
Ian Kelly
Guest
Posts: n/a
 
      11-17-2012
On Sat, Nov 17, 2012 at 11:08 AM, Ian Kelly <(E-Mail Removed)> wrote:
> On Sat, Nov 17, 2012 at 9:56 AM, <(E-Mail Removed)> wrote:
>> "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as isdone here:
>>
>> http://dvcs.w3.org/hg/encoding/raw-f...ndows-1252.txt
>>
>> and here:
>>
>> ftp://ftp.unicode.org/Public/MAPPING...estfit1252.txt

>
> The README for the "BestFit" document states:
>
> """
> These tables include "best fit" behavior which is not present in the
> other files. Examples of best fit
> are converting fullwidth letters to their counterparts when converting
> to single byte code pages, and
> mapping the Infinity character to the number 8.
> """
>
> This does not sound like appropriate behavior for a generalized
> conversion scheme. It is also noted that the "BestFit" document is
> not authoritative at:
>
> http://www.iana.org/assignments/char...g/windows-1252


I meant to also comment on the first link, but forgot. As that
document is published by the W3C, I understand it to be specific to
the Web, which Python is not. Hence I think the more general Unicode
specification is more appropriate for Python.
 
Reply With Quote
 
 
 
 
Nobody
Guest
Posts: n/a
 
      11-17-2012
On Sat, 17 Nov 2012 08:56:46 -0800, buck wrote:

>> Given that the only differences between the two are for code points
>> which are in the C1 range (0x80-0x9F), which should never occur in HTML,
>> parsing ISO-8859-1 as Windows-1252 should be harmless.

>
> "should" is a wish. The reality is that documents (and especially URLs)
> exist that can be decoded with latin1, but will backtrace with cp1252.


In which case, they're probably neither ISO-8859-1 nor Windows-1252, but
some other (unknown) encoding which has acquired the ISO-8859-1 label
"by default".

In that situation, if you still need to know the encoding, you need to
resort to heuristics such as those employed by the chardet library.

 
Reply With Quote
 
 
 
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      11-18-2012
On Fri, 16 Nov 2012 15:27:54 -0800 (PST), http://www.velocityreviews.com/forums/(E-Mail Removed) declaimed the
following in gmane.comp.python.general:

> On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
> > On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
> >
> > > Latin1 has a block of 32 undefined characters.

> >
> >
> > These characters are not undefined. 0x80-0x9f are the C1 control
> > codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
> > their Unicode mappings are well defined.

>
> They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
>
> """ The shaded positions in the code table correspond
> to bit combinations that do not represent graphic
> characters. Their use is outside the scope of
> ISO/IEC 8859; it is specified in other International
> Standards, for example ISO/IEC 6429.
>

This quote only states that those position do not represent
displayable glyphs, and indicates the 8859 is only concerned with
codings for display. It does NOT say they are "undefined".
--
Wulfraed Dennis Lee Bieber AF6VN
(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bug? import cp1252 Méta-MCI Python 2 05-14-2007 08:17 PM
Cp1252 problem =?iso-8859-1?B?bW9vcJk=?= Java 2 09-27-2006 02:20 PM
To circumvent the bug cp1252 Do Re Mi chel La Si Do Python 0 05-15-2005 11:51 AM
Is the default Java character encoding always Cp1252? Mickey Segal Java 5 04-20-2005 04:16 PM
Distinguishing cp850 and cp1252? David Eppstein Python 3 11-03-2003 05:47 AM



Advertisments