Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Python "robots.txt" parser broken since 2003

Reply
Thread Tools

Python "robots.txt" parser broken since 2003

 
 
John Nagle
Guest
Posts: n/a
 
      04-21-2007
This bug, "[ 813986 ] robotparser interactively prompts for username and
password", has been open since 2003. It killed a big batch job of ours
last night.

Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
If the server asks for basic authentication on that file, "robotparser"
prompts for the password on standard input. Which is rarely what you
want. You can demonstrate this with:

import robotparser
url = 'http://mueblesmoraleda.com' # this site is password-protected.
parser = robotparser.RobotFileParser()
parser.set_url(url)
parser.read() # Prompts for password

That's the tandard, although silly, "urllib" behavior.

This was reported in 2003, and a patch was uploaded in 2005, but the patch
never made it into Python 2.4 or 2.5.

A temporary workaround is this:

import robotparser
def prompt_user_passwd(self, host, realm):
return None, None
robotparser.URLopener.prompt_user_passwd = prompt_user_passwd # temp patch


John Nagle
 
Reply With Quote
 
 
 
 
Terry Reedy
Guest
Posts: n/a
 
      04-22-2007

"John Nagle" <(E-Mail Removed)> wrote in message
news:FvtWh.11824$(E-Mail Removed) et...
| This was reported in 2003, and a patch was uploaded in 2005, but the
patch
| never made it into Python 2.4 or 2.5.

If the patch is still open, perhaps you could review it.

tjr



 
Reply With Quote
 
 
 
 
John Nagle
Guest
Posts: n/a
 
      04-22-2007
Terry Reedy wrote:
> "John Nagle" <(E-Mail Removed)> wrote in message
> news:FvtWh.11824$(E-Mail Removed) et...
> | This was reported in 2003, and a patch was uploaded in 2005, but the
> patch
> | never made it into Python 2.4 or 2.5.
>
> If the patch is still open, perhaps you could review it.
>

I tried it on Python 2.4 and it's in our production system now.
But someone who regularly does check-ins should do this.

John Nagle
 
Reply With Quote
 
Steven Bethard
Guest
Posts: n/a
 
      04-22-2007
John Nagle wrote:
> Terry Reedy wrote:
>> "John Nagle" <(E-Mail Removed)> wrote in message
>> news:FvtWh.11824$(E-Mail Removed) et...
>> | This was reported in 2003, and a patch was uploaded in 2005, but the
>> patch
>> | never made it into Python 2.4 or 2.5.
>>
>> If the patch is still open, perhaps you could review it.
>>

> I tried it on Python 2.4 and it's in our production system now.
> But someone who regularly does check-ins should do this.


If you post such a review (even just the short sentence above) to the
patch tracker, it often increases the chance of someone committing the
patch.

Steve
 
Reply With Quote
 
Nikita the Spider
Guest
Posts: n/a
 
      04-22-2007
In article <FvtWh.11824$(E-Mail Removed) >,
John Nagle <(E-Mail Removed)> wrote:

> This bug, "[ 813986 ] robotparser interactively prompts for username and
> password", has been open since 2003. It killed a big batch job of ours
> last night.
>
> Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
> If the server asks for basic authentication on that file, "robotparser"
> prompts for the password on standard input. Which is rarely what you
> want. You can demonstrate this with:
>
> import robotparser
> url = 'http://mueblesmoraleda.com' # this site is password-protected.
> parser = robotparser.RobotFileParser()
> parser.set_url(url)
> parser.read() # Prompts for password
>
> That's the tandard, although silly, "urllib" behavior.


John,
robotparser is (IMO) suboptimal in a few other ways, too.
- It doesn't handle non-ASCII characters. (They're infrequent but when
writing a spider which sees thousands of robots.txt files in a short
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to
pass up to the caller.

I wrote my own parser to address these problems. It probably suffers
from the same urllib hang that you've found (I have not encountered it
myself) and I appreciate you posting a fix. Here's the code &
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      04-22-2007
Steven Bethard wrote:
> John Nagle wrote:
>
>> Terry Reedy wrote:
>>
>>> "John Nagle" <(E-Mail Removed)> wrote in message
>>> news:FvtWh.11824$(E-Mail Removed) et...
>>> | This was reported in 2003, and a patch was uploaded in 2005, but
>>> the patch
>>> | never made it into Python 2.4 or 2.5.
>>>
>>> If the patch is still open, perhaps you could review it.
>>>

>> I tried it on Python 2.4 and it's in our production system now.
>> But someone who regularly does check-ins should do this.

>
>
> If you post such a review (even just the short sentence above) to the
> patch tracker, it often increases the chance of someone committing the
> patch.
>
> Steve


OK, updated the tracker comments.

John Nagle
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Has the Network + Study Manuals Changed Much Since 2003? David Halpern Cisco 0 05-22-2006 08:33 PM
Error Creating Web Site since 2003 Server Service Pack 1 Peter D. Dunlap ASP .Net 0 04-11-2005 07:44 PM
strange problem since migrating to server 2003 David C ASP .Net 3 10-14-2004 05:58 PM
Home Network broken since SP2 MiniEmma Computer Support 3 09-29-2004 04:17 AM
Have the A+ exams got harder since Nov 2003? Will Hay A+ Certification 4 02-29-2004 12:00 AM



Advertisments