Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > 2to3 chokes on bad character

Reply
Thread Tools

2to3 chokes on bad character

 
 
Frank Millman
Guest
Posts: n/a
 
      02-23-2011
Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indication of
which file caused the problem.

Here is the traceback -

Traceback (most recent call last):
File "C:\Python32\Tools\Scripts\2to3.py", line 5, in <module>
sys.exit(main("lib2to3.fixes"))
File "C:\Python32\lib\lib2to3\main.py", line 172, in main
options.processes)
File "C:\Python32\lib\lib2to3\refactor.py", line 700, in refactor
items, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 294, in refactor
self.refactor_dir(dir_or_file, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 314, in refactor_dir
self.refactor_file(fullname, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 741, in refactor_file
*args, **kwargs)
File "C:\Python32\lib\lib2to3\refactor.py", line 336, in refactor_file
input, encoding = self._read_python_source(filename)
File "C:\Python32\lib\lib2to3\refactor.py", line 332, in
_read_python_source
return _from_system_newlines(f.read()), encoding
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor, but
one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6) or 226, 8364, 8482 (as reported by python3.2).

The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2).

The issue is not that 2to3 should handle this correctly, but that it should
give a more informative error message to the unsuspecting user.

Frank Millman

BTW I have always waited for 'final releases' before upgrading in the past,
but this makes me realise the importance of checking out the beta versions -
I will do so in future.


 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      02-24-2011
On Feb 23, 7:47*pm, "Frank Millman" <(E-Mail Removed)> wrote:
> Hi all
>
> I don't know if this counts as a bug in 2to3.py, but when I ran it on my
> program directory it crashed, with a traceback but without any indicationof
> which file caused the problem.
>

[traceback snipped]

> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
> invalid start byte
>
> On investigation, I found some funny characters in docstrings that I
> copy/pasted from a pdf file.
>
> Here are the details if they are of any use. Oddly, I found two instances
> where characters 'look like' apostrophes when viewed in my text editor, but
> one of them was accepted by 2to3 and the other caused the crash.
>
> The one that was accepted consists of three bytes - 226, 128, 153 (as
> reported by python 2.6)


How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.

>>> from unicodedata import name as ucname
>>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')

u'\u2019'
>>> ucname(_)

'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:

>>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

[226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().

>
> The one that crashed consists of a single byte - 146 (python 2.6) or 8217
> (python 3.2).


>>> chr(146).decode('cp1252')

u'\u2019'
>>> hex(8217)

'0x2019'


> The issue is not that 2to3 should handle this correctly, but that it should
> give a more informative error message to the unsuspecting user.


Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

> BTW I have always waited for 'final releases' before upgrading in the past,
> but this makes me realise the importance of checking out the beta versions -
> I will do so in future.


I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about



 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      02-24-2011
John Machin wrote:

> On Feb 23, 7:47 pm, "Frank Millman" <(E-Mail Removed)> wrote:
>> Hi all
>>
>> I don't know if this counts as a bug in 2to3.py, but when I ran it on my
>> program directory it crashed, with a traceback but without any indication
>> of which file caused the problem.
>>

> [traceback snipped]
>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
>> invalid start byte
>>
>> On investigation, I found some funny characters in docstrings that I
>> copy/pasted from a pdf file.
>>
>> Here are the details if they are of any use. Oddly, I found two instances
>> where characters 'look like' apostrophes when viewed in my text editor,
>> but one of them was accepted by 2to3 and the other caused the crash.
>>
>> The one that was accepted consists of three bytes - 226, 128, 153 (as
>> reported by python 2.6)

>
> How did you incite it to report like that? Just use repr(the_3_bytes).
> It'll show up as '\xe2\x80\x99'.
>
> >>> from unicodedata import name as ucname
> >>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')

> u'\u2019'
> >>> ucname(_)

> 'RIGHT SINGLE QUOTATION MARK'
>
> What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
> QUOTATION MARK. That's OK.
>
> or 226, 8364, 8482 (as reported by python3.2).
>
> Sorry, but you have instructed Python 3.2 to commit a nonsense:
>
> >>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

> [226, 8364, 8482]
>
> In other words, you have taken that 3-byte sequence, decoded each byte
> separately using cp1252 (aka "the usual suspect") into a meaningless
> Unicode character and printed its ordinal.
>
> In Python 3, don't use repr(); it has undergone the MHTP
> transformation and become ascii().
>
>>
>> The one that crashed consists of a single byte - 146 (python 2.6) or 8217
>> (python 3.2).

>
> >>> chr(146).decode('cp1252')

> u'\u2019'
> >>> hex(8217)

> '0x2019'
>
>
>> The issue is not that 2to3 should handle this correctly, but that it
>> should give a more informative error message to the unsuspecting user.

>
> Your Python 2.x code should be TESTED before you poke 2to3 at it. In
> this case just trying to run or import the offending code file would
> have given an informative syntax error (you have declared the .py file
> to be encoded in UTF-8 but it's not).


The problem is that Python 2.x accepts arbitrary bytes in string constants.
No error message or warning:

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("tmp.py", "w") as f: # prepare the broken script

.... f.write("# -*- coding: utf-8 -*-\nprint 'bogus char: \x92'\n")
....
>>>

$ cat tmp.py
# -*- coding: utf-8 -*-
print 'bogus char: �'
$ python2.6 tmp.py
bogus char: �
$ 2to3-3.2 tmp.py
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 43:
invalid start byte

In theory 2to3 could be changed to take the same approach as os.listdir(),
but as in the OP's example occurences of the problem are likely to be
editing accidents.
 
Reply With Quote
 
Frank Millman
Guest
Posts: n/a
 
      02-24-2011

"John Machin" <(E-Mail Removed)> wrote:
On Feb 23, 7:47 pm, "Frank Millman" <(E-Mail Removed)> wrote:

[snip lots of valuable info]

>> The issue is not that 2to3 should handle this correctly, but that it
>> should
>> give a more informative error message to the unsuspecting user.


>Your Python 2.x code should be TESTED before you poke 2to3 at it. In
>this case just trying to run or import the offending code file would
>have given an informative syntax error (you have declared the .py file
>to be encoded in UTF-8 but it's not).


Thank you, John - this is the main lesson.

The file that caused the error has a .py extension, and looks like a python
file, but it just contains documentation. It has never been executed or
imported.

As you say, if I had tried to run it under Python 2 it would have failed
straight away. In these circumstances, it is unreasonable to expect 2to3 to
know what to do with it, so it is definitely not a bug.

>> BTW I have always waited for 'final releases' before upgrading in the
>> past,
>> but this makes me realise the importance of checking out the beta
>> versions -
>> I will do so in future.


>I'm willing to bet that the same would happen with Python 3.1, if a
>3.1 to 3.2 upgrade is what you are talking about


This is my first look at Python 3, so I am talking about moving from 2.6 to
3.2. In this case, it turns out that it was not a bug, but still, in future
I will run some tests when betas are released, just in case I come up with
something.

Thanks for your response - it was very useful.

Frank


 
Reply With Quote
 
Frank Millman
Guest
Posts: n/a
 
      02-24-2011

"Peter Otten" <(E-Mail Removed)> wrote
> John Machin wrote:
>
>>
>> Your Python 2.x code should be TESTED before you poke 2to3 at it. In
>> this case just trying to run or import the offending code file would
>> have given an informative syntax error (you have declared the .py file
>> to be encoded in UTF-8 but it's not).

>
> The problem is that Python 2.x accepts arbitrary bytes in string
> constants.
> No error message or warning:
>


Thanks, Peter. I saw this after I replied to John, so this somewhat
invalidates my reply.

However, John's principle still holds true, and that is the main lesson I
have taken away from this.

Frank


 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      02-24-2011
On 2/24/2011 8:11 AM, Frank Millman wrote:

> future I will run some tests when betas are released, just in case I
> come up with something.


Please do, perhaps more than once. The test suite coverage is being
improved but is not 100%. The day *after* 3.2.0 was released, someone
reported an unpleasant bug, a regression from 3.1.x. If they are tested
with the last beta or first release candidate, it would have been found
and fixed. Now its there until 3.2.1.

--
Terry Jan Reedy

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-24-2011
On Feb 25, 12:00*am, Peter Otten <(E-Mail Removed)> wrote:
> John Machin wrote:


> > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
> > this case just trying to run or import the offending code file would
> > have given an informative syntax error (you have declared the .py file
> > to be encoded in UTF-8 but it's not).

>
> The problem is that Python 2.x accepts arbitrary bytes in string constants.


Ummm ... isn't that a bug? According to section 2.1.4 of the Python
2.7.1 Language Reference Manual: """The encoding is used for all
lexical analysis, in particular to find the end of a string, and to
interpret the contents of Unicode literals. String literals are
converted to Unicode for syntactical analysis, then converted back to
their original encoding before interpretation starts ..."""

How do you reconcile "used for all lexical analysis" and "String
literals are converted to Unicode for syntactical analysis" with the
actual (astonishing to me) behaviour?
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      02-25-2011
John Machin wrote:

> On Feb 25, 12:00 am, Peter Otten <(E-Mail Removed)> wrote:
>> John Machin wrote:

>
>> > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
>> > this case just trying to run or import the offending code file would
>> > have given an informative syntax error (you have declared the .py file
>> > to be encoded in UTF-8 but it's not).

>>
>> The problem is that Python 2.x accepts arbitrary bytes in string
>> constants.

>
> Ummm ... isn't that a bug? According to section 2.1.4 of the Python
> 2.7.1 Language Reference Manual: """The encoding is used for all
> lexical analysis, in particular to find the end of a string, and to
> interpret the contents of Unicode literals. String literals are
> converted to Unicode for syntactical analysis, then converted back to
> their original encoding before interpretation starts ..."""
>
> How do you reconcile "used for all lexical analysis" and "String
> literals are converted to Unicode for syntactical analysis" with the
> actual (astonishing to me) behaviour?


You are right, the current behaviour is probably an implementation accident
stemming from the assumption that

s.decode("utf-8").encode("utf-8") == s

always holds. Other encodings (I tried cp1252) produce the expected
SyntaxError.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bad media, bad files or bad Nero? John Computer Information 23 01-08-2008 09:17 PM
HTMLParser chokes on bad end tag in comment Rene Pijlman Python 6 05-29-2006 07:53 PM
ActiveX apologetic Larry Seltzer... "Sun paid for malicious ActiveX code, and Firefox is bad, bad bad baad. please use ActiveX, it's secure and nice!" (ok, the last part is irony on my part) fernando.cassia@gmail.com Java 0 04-16-2005 10:05 PM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 12 02-23-2005 03:28 AM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 0 02-19-2005 01:10 AM



Advertisments