Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   latin1 and cp1252 inconsistent? (http://www.velocityreviews.com/forums/t954566-latin1-and-cp1252-inconsistent.html)

buck@yelp.com 11-16-2012 09:44 PM

latin1 and cp1252 inconsistent?
 
Latin1 has a block of 32 undefined characters.
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

The byte 0x81 decoded with latin gives the unicode 0x81.
Decoding the same byte with windows-1252 yields a stack trace with `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>`

This seems inconsistent to me, given that this byte is equally undefined inthe two standards.

Also, the html5 standard says:

When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].

http://www.whatwg.org/specs/web-apps...er-encodings-0


The current implementation of windows-1252 isn't usable for this purpose (areplacement of latin1), since it will throw an error in cases that latin1 would succeed.

Ian Kelly 11-16-2012 10:33 PM

Re: latin1 and cp1252 inconsistent?
 
On Fri, Nov 16, 2012 at 2:44 PM, <buck@yelp.com> wrote:
> Latin1 has a block of 32 undefined characters.


These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

> Windows-1252 (aka cp1252) fills in 27 of these characters but leaves fiveundefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D


In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

> Also, the html5 standard says:
>
> When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters tobytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].
>
> http://www.whatwg.org/specs/web-apps...er-encodings-0
>
>
> The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.


You can use a non-strict error handling scheme to prevent the error.

>>> b'hello \x81 world'.decode('cp1252')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

>>> b'hello \x81 world'.decode('cp1252', 'replace')

'hello \ufffd world'
>>> b'hello \x81 world'.decode('cp1252', 'ignore')

'hello world'

buck@yelp.com 11-16-2012 11:27 PM

Re: latin1 and cp1252 inconsistent?
 
On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
> On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
>
> > Latin1 has a block of 32 undefined characters.

>
>
> These characters are not undefined. 0x80-0x9f are the C1 control
> codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
> their Unicode mappings are well defined.


They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.


> You can use a non-strict error handling scheme to prevent the error.
> >>> b'hello \x81 world'.decode('cp1252', 'replace')

> 'hello \ufffd world'


This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

buck@yelp.com 11-16-2012 11:27 PM

Re: latin1 and cp1252 inconsistent?
 
On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
> On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
>
> > Latin1 has a block of 32 undefined characters.

>
>
> These characters are not undefined. 0x80-0x9f are the C1 control
> codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
> their Unicode mappings are well defined.


They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.


> You can use a non-strict error handling scheme to prevent the error.
> >>> b'hello \x81 world'.decode('cp1252', 'replace')

> 'hello \ufffd world'


This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Dave Angel 11-17-2012 12:05 AM

Re: latin1 and cp1252 inconsistent?
 
On 11/16/2012 06:27 PM, buck@yelp.com wrote:
> (doublespaced nonsense deleted. GoogleGropups strikes again.)
> This creates a non-reversible encoding, and loss of data, which isn't
> acceptable for my application.


So tell us more about your application. If you have data which is
invalid, and you encode it to some other form, you have to expect that
it won't be reversible. But maybe your data isn't really characters at
all, and you're just trying to manipulate bytes?

Without a use case, we really can't guess. The fact that you are
waffling between latin1 and 1252 indicates this isn't really character data.

Also, while you're at it, please specify the Python version and OS
you're on. You haven't given us any code to guess it from.

--

DaveA


Ian Kelly 11-17-2012 12:20 AM

Re: latin1 and cp1252 inconsistent?
 
On Fri, Nov 16, 2012 at 4:27 PM, <buck@yelp.com> wrote:
> They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
>
> """ The shaded positions in the code table correspond
> to bit combinations that do not represent graphic
> characters. Their use is outside the scope of
> ISO/IEC 8859; it is specified in other International
> Standards, for example ISO/IEC 6429.


It gets murkier than that. I don't want to spend time hunting down
the relevant documents, so I'll just quote from Wikipedia:

"""
In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the
extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
the Internet. This map assigns the C0 and C1 control characters to the
unassigned code values thus provides for 256 characters via every
possible 8-bit value.
"""

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History

>> You can use a non-strict error handling scheme to prevent the error.
>> >>> b'hello \x81 world'.decode('cp1252', 'replace')

>> 'hello \ufffd world'

>
> This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.


Well, what characters would you have these bytes decode to,
considering that they're undefined? If the string is really CP-1252,
then the presence of undefined characters in the document does not
signify "data". They're just junk bytes, possibly indicative of data
corruption. If on the other hand the string is really Latin-1, and
you *know* that it is Latin-1, then you should probably forget the
aliasing recommendation and just decode it as Latin-1.

Apparently this Latin-1 -> CP-1252 encoding aliasing is already
commonly performed by modern user agents. What do IE and Firefox do
when presented with a Latin-1 encoding and undefined CP-1252 codings?

Nobody 11-17-2012 12:33 AM

Re: latin1 and cp1252 inconsistent?
 
On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:

> When a user agent [browser] would otherwise use a character encoding given
> in the first column [ISO-8859-1, aka latin1] of the following table to
> either convert content to Unicode characters or convert Unicode characters
> to bytes, it must instead use the encoding given in the cell in the second
> column of the same row [windows-1252, aka cp1252].


It goes on to say:

The requirement to treat certain encodings as other encodings according
to the table above is a willful violation of the W3C Character Model
specification, motivated by a desire for compatibility with legacy
content. [CHARMOD]

IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.

Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.

If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.


Ian Kelly 11-17-2012 01:08 AM

Re: latin1 and cp1252 inconsistent?
 
On Fri, Nov 16, 2012 at 5:33 PM, Nobody <nobody@nowhere.com> wrote:
> If you need to support either, you can parse it as ISO-8859-1 then
> explicitly convert C1 codes to their Windows-1252 equivalents as a
> post-processing step, e.g. using the .translate() method.


Or just create a custom codec by taking the one in
Lib/encodings/cp1252.py and modifying it slightly.


>>> import codecs
>>> import cp1252a
>>> codecs.register(lambda n: cp1252a.getregentry() if n == "cp1252a" else None)
>>> b'\x81\x8d\x8f\x90\x9d'.decode('cp1252a')

'♕♖♗♘♙'

buck@yelp.com 11-17-2012 04:56 PM

Re: latin1 and cp1252 inconsistent?
 
On Friday, November 16, 2012 4:33:14 PM UTC-8, Nobody wrote:
> On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:
> IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
> successful and now we have to deal with it. If HTML content is tagged as
> using ISO-8859-1, it's more likely that it's actually Windows-1252 content
> generated by someone who doesn't know the difference.


Yes that's exactly what it says.

> Given that the only differences between the two are for code points which
> are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
> ISO-8859-1 as Windows-1252 should be harmless.


"should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

http://dvcs.w3.org/hg/encoding/raw-f...ndows-1252.txt

and here:

ftp://ftp.unicode.org/Public/MAPPING...estfit1252.txt

This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges ofthese code
> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
> respectively ... There is a simple, one-to-one mapping between 7-bit (and8-bit) control
> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
> equal to its corresponding Unicode code point.


IOW: Bytes with undefined semantics in the C0/C1 range are "control codes",which decode to the unicode-point of equal value.

This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

Ian Kelly 11-17-2012 06:08 PM

Re: latin1 and cp1252 inconsistent?
 
On Sat, Nov 17, 2012 at 9:56 AM, <buck@yelp.com> wrote:
> "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I seethis as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:
>
> http://dvcs.w3.org/hg/encoding/raw-f...ndows-1252.txt
>
> and here:
>
> ftp://ftp.unicode.org/Public/MAPPING...estfit1252.txt


The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme. It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/char...g/windows-1252


> This is in line with the unicode standard, which says: http://www.unicode..org/versions/Unicode6.2.0/ch16.pdf
>
>> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
>> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
>> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
>> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1controls),
>> respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
>> codes and the Unicode control codes: every 7-bit (or 8-bit) control codeis numerically
>> equal to its corresponding Unicode code point.

>
> IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.
>
> This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6..2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)


But Latin-1 explicitly defers to to the control codes for those
characters. CP-1252 does not; the reason those characters are left
undefined is to allow for future expansion, such as when Microsoft
added the Euro sign at 0x80.

Since we're talking about conversion from bytes to Unicode, I think
the most authoritative source we could possibly reference would be the
official ISO 10646 conversion tables for the character sets in
question. I understand those are to be found here:

http://www.unicode.org/Public/MAPPIN...859/8859-1.TXT

and here:

http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT

Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
the cp1252 mapping leaves those five codes undefined. This would seem
to indicate that Python is correctly decoding CP-1252 according to the
Unicode standard.


All times are GMT. The time now is 05:51 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.