Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > getattr/setattr still ASCII-only, not Unicode - blows up SGMLlibfrom BeautifulSoup

Reply
Thread Tools

getattr/setattr still ASCII-only, not Unicode - blows up SGMLlibfrom BeautifulSoup

 
 
John Nagle
Guest
Posts: n/a
 
      03-13-2008
Just noticed, again, that getattr/setattr are ASCII-only, and don't support
Unicode.

SGMLlib blows up because of this when faced with a Unicode end tag:

File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
in position 46: ordinal not in range(12

Should attributes be restricted to ASCII, or is this a bug?

John Nagle
 
Reply With Quote
 
 
 
 
Terry Reedy
Guest
Posts: n/a
 
      03-13-2008

"John Nagle" <(E-Mail Removed)> wrote in message
news:47d97288$0$36363$(E-Mail Removed)...
| Just noticed, again, that getattr/setattr are ASCII-only, and don't
support
| Unicode.
|
| SGMLlib blows up because of this when faced with a Unicode end tag:
|
| File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
| method = getattr(self, 'end_' + tag)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
| in position 46: ordinal not in range(12
|
| Should attributes be restricted to ASCII, or is this a bug?

Except for comments and string literals preceded by an encoding
declaration,
Python code is ascii only:
" Python uses the 7-bit ASCII character set for program text."
ref manual 2. lexical analisis

This changes in 3.0



 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      03-13-2008
On Mar 14, 5:38 am, John Nagle <(E-Mail Removed)> wrote:
> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
> Unicode.
>
> SGMLlib blows up because of this when faced with a Unicode end tag:
>
> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
> method = getattr(self, 'end_' + tag)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
> in position 46: ordinal not in range(12
>
> Should attributes be restricted to ASCII, or is this a bug?
>
> John Nagle


Identifiers are restricted -- see section 2.3 (Identifiers and
keywords) of the Reference Manual. The restriction is in effect that
they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
obj.nonASCIIname in your code, it makes sense for the equivalent usage
in setattr and getattr not to be available.

However other than forcing unicode to str, setattr and getattr seem
not to care what you use:

>>> class O(object):

.... pass
....
>>> o = O()
>>> setattr(o, '42', 'universe')
>>> getattr(o, '42')

'universe'
>>> # doesn't even need to be ASCII
>>> setattr(o, '\xff', 'notA-Za-z etc')
>>> getattr(o, '\xff')

'notA-Za-z etc'
>>>


Cheers,
John
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      03-14-2008
John Machin wrote:
> On Mar 14, 5:38 am, John Nagle <(E-Mail Removed)> wrote:
>> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
>> Unicode.
>>
>> SGMLlib blows up because of this when faced with a Unicode end tag:
>>
>> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
>> method = getattr(self, 'end_' + tag)
>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
>> in position 46: ordinal not in range(12
>>
>> Should attributes be restricted to ASCII, or is this a bug?
>>
>> John Nagle

>
> Identifiers are restricted -- see section 2.3 (Identifiers and
> keywords) of the Reference Manual. The restriction is in effect that
> they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
> obj.nonASCIIname in your code, it makes sense for the equivalent usage
> in setattr and getattr not to be available.
>
> However other than forcing unicode to str, setattr and getattr seem
> not to care what you use:


OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
subclass with a function with a name such as "end_img", to be called
at the end of an "img" tag. The mechanism which implements this blows
up on any tag name that won't convert to "str", even when there are
no "end_" functions that could be relevant.

It's easy to fix in SGMLlib. It's just necessary to change

except AttributeError:
to
except AttributeError, UnicodeEncodeError:

in four places. I suppose I'll have to submit a patch.

John Nagle
SiteTruth

 
Reply With Quote
 
Carl Banks
Guest
Posts: n/a
 
      03-14-2008
On Mar 14, 1:53 am, John Nagle <(E-Mail Removed)> wrote:
> John Machin wrote:
> > On Mar 14, 5:38 am, John Nagle <(E-Mail Removed)> wrote:
> >> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
> >> Unicode.

>
> >> SGMLlib blows up because of this when faced with a Unicode end tag:

>
> >> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
> >> method = getattr(self, 'end_' + tag)
> >> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
> >> in position 46: ordinal not in range(12

>
> >> Should attributes be restricted to ASCII, or is this a bug?

>
> >> John Nagle

>
> > Identifiers are restricted -- see section 2.3 (Identifiers and
> > keywords) of the Reference Manual. The restriction is in effect that
> > they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
> > obj.nonASCIIname in your code, it makes sense for the equivalent usage
> > in setattr and getattr not to be available.

>
> > However other than forcing unicode to str, setattr and getattr seem
> > not to care what you use:

>
> OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
> subclass with a function with a name such as "end_img", to be called
> at the end of an "img" tag. The mechanism which implements this blows
> up on any tag name that won't convert to "str", even when there are
> no "end_" functions that could be relevant.
>
> It's easy to fix in SGMLlib. It's just necessary to change
>
> except AttributeError:
> to
> except AttributeError, UnicodeEncodeError:
>
> in four places. I suppose I'll have to submit a patch.



FWIW, the stated goal of sgmllib is to parse the subset of SGML that
HTML uses. There are no non-ascii elements in HTML, so I'm not
certain this would be considered a bug in sgmllib.


Carl Banks
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
__unicode__() works, unicode() blows up. Roy Smith Python 3 11-04-2012 06:10 PM
tail-rec decorator, well still blows the stack... ssecorp Python 4 07-22-2008 06:15 AM
Firefox window blows up to more than screen size covering all Jim Firefox 5 06-01-2005 03:01 AM
Re: Standby/Hibernate blows out wireless connection =?Utf-8?B?VGlt?= Wireless Networking 0 08-07-2004 09:55 PM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments