Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Another BeautifulSoup crash on bad HTML

Reply
Thread Tools

Another BeautifulSoup crash on bad HTML

 
 
John Nagle
Guest
Posts: n/a
 
      05-15-2008
Can't really blame BeautifulSoup for this, but our crawler hit a page
("http://clagnut.com/privacy/") with an out of range character escape:

𔃷

in this text:

If you provide a name, email address and/or website and choose ‘Remember
me𔃷, these details will be stored as a cookie on your computer.

The author clearly meant "’", which is a single close quote.

The traceback as BeautifulSoup aborts:

SGMLParser.feed(self, markup or "")
File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/sgmllib.py", line 181, in goahead
self.handle_charref(name)
File "/var/www/vhosts/sitetruth.com/cgi-bin/sitetruth/BeautifulSoup.py", line
1250, in handle_charref
data = unichr(int(ref))
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Another item in our ongoing saga of "What happens when you parse real-world
HTML".

A try-block in handle_charref would be appropriate.

John Nagle
SiteTruth
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bad media, bad files or bad Nero? John Computer Information 23 01-08-2008 09:17 PM
Codec lookup fails for bad codec name, blowing up BeautifulSoup John Nagle Python 3 11-10-2007 02:55 AM
ActiveX apologetic Larry Seltzer... "Sun paid for malicious ActiveX code, and Firefox is bad, bad bad baad. please use ActiveX, it's secure and nice!" (ok, the last part is irony on my part) fernando.cassia@gmail.com Java 0 04-16-2005 10:05 PM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 12 02-23-2005 03:28 AM
24 Season 3 Bad Bad Bad (Spoiler) nospam@nospam.com DVD Video 0 02-19-2005 01:10 AM



Advertisments