Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Codec lookup fails for bad codec name, blowing up BeautifulSoup

Reply
Thread Tools

Codec lookup fails for bad codec name, blowing up BeautifulSoup

 
 
John Nagle
Guest
Posts: n/a
 
      11-09-2007
I just had our web page parser fail on "www.nasa.gov".
It seems that NASA returns an HTTP header with a charset of ".utf8", which
is non-standard. This goes into BeautifulSoup, which blows up trying to
find a suitable codec.

This happens because BeautifulSoup does this:

def _codec(self, charset):
if not charset: return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except LookupError:
pass
return codec

The documentation for codecs.lookup says:

lookup(encoding)
Looks up a codec tuple in the Python codec registry and returns
the function tuple as defined above.

Encodings are first looked up in the registry's cache. If not found,
the list of registered search functions is scanned.
If no codecs tuple is found, a LookupError is raised.

So BeautifulSoup's lookup ought to be safe, right? Wrong.
What actually happens is a ValueError exception:

File "./sitetruth/BeautifulSoup.py", line 1770, in _codec
codecs.lookup(charset)
File "/usr/local/lib/python2.5/encodings/__init__.py", line 97,
in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

This is a known bug. It's in the old tracker on SourceForge:
[ python-Bugs-960874 ] codecs.lookup can raise exceptions other
than LookupError
but not in the new tracker.

The "resolution" back in 2004 was "Won't Fix", without a change
to the documentation. Grrr.

Patched BeautifulSoup to work around the problem:

def _codec(self, charset):
if not charset: return charset
codec = None
try:
codecs.lookup(charset)
codec = charset
except (LookupError, ValueError):
pass
return codec


John Nagle
 
Reply With Quote
 
 
 
 
Waldemar Osuch
Guest
Posts: n/a
 
      11-09-2007

>
> This is a known bug. It's in the old tracker on SourceForge:
> [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
> than LookupError
> but not in the new tracker.


The new tracker has it too.
http://bugs.python.org/issue960874

>
> The "resolution" back in 2004 was "Won't Fix", without a change
> to the documentation. Grrr.
>


 
Reply With Quote
 
 
 
 
John Nagle
Guest
Posts: n/a
 
      11-09-2007
Waldemar Osuch wrote:
>> This is a known bug. It's in the old tracker on SourceForge:
>> [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
>> than LookupError
>> but not in the new tracker.

>
> The new tracker has it too.
> http://bugs.python.org/issue960874


How did you find that? I put "codecs.lookup" into the tracker's
search box, and it returned five hits, but not that one.

John Nagle
 
Reply With Quote
 
Waldemar Osuch
Guest
Posts: n/a
 
      11-10-2007
On Nov 9, 4:15 pm, John Nagle <(E-Mail Removed)> wrote:
> Waldemar Osuch wrote:
> >> This is a known bug. It's in the old tracker on SourceForge:
> >> [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
> >> than LookupError
> >> but not in the new tracker.

>
> > The new tracker has it too.
> >http://bugs.python.org/issue960874

>
> How did you find that? I put "codecs.lookup" into the tracker's
> search box, and it returned five hits, but not that one.
>
> John Nagle


I have seen this explained on this list once:
http://bugs.python.org/issues + <source forge bug id>
points to the converted ticket.
And yes the search could be better.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Another BeautifulSoup crash on bad HTML John Nagle Python 0 05-15-2008 05:33 AM
ActiveX apologetic Larry Seltzer... "Sun paid for malicious ActiveX code, and Firefox is bad, bad bad baad. please use ActiveX, it's secure and nice!" (ok, the last part is irony on my part) fernando.cassia@gmail.com Java 0 04-16-2005 10:05 PM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 12 02-23-2005 03:28 AM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 0 02-19-2005 01:10 AM
Purchasing Camera Experience...... Bad hair Day Rant - Blowing Off Steaml... BobS Digital Photography 2 08-21-2004 04:07 AM



Advertisments