Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Py 2.5: Bug in sgmllib

Reply
Thread Tools

Py 2.5: Bug in sgmllib

 
 
Michael Butscher
Guest
Posts: n/a
 
      10-22-2006
Hi,

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')



I get the exception:

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(12



The reason is that the character reference &#223; is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.


Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:

def convert_codepoint(self, codepoint):
return unichr(codepoint)



Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?



Michael
 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      10-22-2006
Michael Butscher wrote:


> if I execute the following two lines in Python 2.5 (to feed in a
> *unicode* string):
>
> import sgmllib
> sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')


source documents are encoded byte streams, not decoded Unicode
sequences. I suggest reading up on how Python's Unicode string
type is, and what a Unicode string represents. it's not the same
thing as a byte string.

</F>

 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      10-22-2006
Michael Butscher schrieb:
> Is this a bug or is SGMLParser not meant to be used for unicode strings
> (it should be documented then)?


In a sense, SGML itself is not meant to be used for Unicode. In SGML,
the document character set is subject to the SGML application. So what
specific character a character reference refers to is also subject to
the SGML application.

This entire issue is already documented; see the discussion of
convert_charref and convert_codepoint in

http://docs.python.org/lib/module-sgmllib.html

Regards,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sgmllib bug in Python 2.5, works in 2.4. John Nagle Python 2 02-07-2007 08:37 AM
finish_endtag in sgmllib.py [Python 2.4] Richard Hsu Python 2 04-12-2006 01:39 AM
Rss/xml namespaces sgmllib, sax, minidom Sakcee Python 1 01-02-2006 06:26 AM
SGMLlib module Harlin Seritt Python 3 05-08-2005 05:30 PM
sgmllib problem & proposed fix. C. Titus Brown Python 1 12-17-2004 08:31 AM



Advertisments