Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > uses of robots.txt

Reply
Thread Tools

uses of robots.txt

 
 
Math
Guest
Posts: n/a
 
      10-06-2007
Hi,

There is something I really don't understand ; and I would like your
advises...

1. Some websites, (for instance news.google.fr) contains a
syndication feed (like http://news.google.fr/nwshp?topic=po&output=atom).

2. Theses websites have a robots.txt file preventing some robots
(declared by user-agents) from indexation.
For example : http://news.google.fr/robots.txt contains (extract) :
User-agent: *
Disallow: /nwshp

3. I've developped an syndication aggregator, and I woul'd like to
respect these robots.txt files. but as I can see and understand, my
user-agent isn't authorized to acces /nwshp?topic=po&output=atom
because of this robots.txt...

So, is it normal ? robots.txt files are only for indexation robots ?
to sum up, my syndication aggregator should respect these files or
not ?

Thanks.

 
Reply With Quote
 
 
 
 
Nikita the Spider
Guest
Posts: n/a
 
      10-07-2007
In article <(E-Mail Removed). com>,
Math <(E-Mail Removed)> wrote:

> Hi,
>
> There is something I really don't understand ; and I would like your
> advises...
>
> 1. Some websites, (for instance news.google.fr) contains a
> syndication feed (like http://news.google.fr/nwshp?topic=po&output=atom).
>
> 2. Theses websites have a robots.txt file preventing some robots
> (declared by user-agents) from indexation.
> For example : http://news.google.fr/robots.txt contains (extract) :
> User-agent: *
> Disallow: /nwshp
>
> 3. I've developped an syndication aggregator, and I woul'd like to
> respect these robots.txt files. but as I can see and understand, my
> user-agent isn't authorized to acces /nwshp?topic=po&output=atom
> because of this robots.txt...
>
> So, is it normal ? robots.txt files are only for indexation robots ?
> to sum up, my syndication aggregator should respect these files or
> not ?


Hi Math,
It's hard to say, but if they prefer to keep this content from being
copied to other sites, robots.txt is the way to do it. In other words,
you can't assume they just want to keep indexing bots out, they might
want to keep all bots out.

If your aggregator is only being used by you and a few friends, then
probably Google et al wouldn't care if your bot visits them once per
hour or so. But if you want this aggregator to be used by lots of
people, then I'd say you need to respect robots.txt.

BTW the closest thing there is to a standard for robots.txt is here:
http://www.robotstxt.org/wc/norobots-rfc.html

When describing robots, it focuses on indexing bots. But it was written
at a time when Web robots were less varied then they are now, so the
author may not have considered your case.

Good luck

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
 
Reply With Quote
 
 
 
 
Newsgroups
Guest
Posts: n/a
 
      10-07-2007

Thanks for your answers Nikita the Spider,


> If your aggregator is only being used by you and a few friends,

Currently, yes ;-( but I developped it also for anybody who want to use
it.

> But if you want this aggregator to be used by lots of
> people, then I'd say you need to respect robots.txt.

The problem is : where is the limit between "few friends" and "lots of
people"...


> When describing robots, it focuses on indexing bots. But it was written
> at a time when Web robots were less varied then they are now, so the
> author may not have considered your case.

Yes, I agree. It's another debate, and I'm not used to reed rfc, so what
mean "Expires June 4, 1997" on this rfc ? Mean that Comments are not
considered after this date ? If not, I could comment this rfc.

 
Reply With Quote
 
Ken Sims
Guest
Posts: n/a
 
      10-07-2007
On Sat, 06 Oct 2007 23:19:49 -0400, Nikita the Spider
<(E-Mail Removed)> wrote:

>In article <(E-Mail Removed). com>,
> Math <(E-Mail Removed)> wrote:
>>
>> So, is it normal ? robots.txt files are only for indexation robots ?
>> to sum up, my syndication aggregator should respect these files or
>> not ?

>
>Hi Math,
>It's hard to say, but if they prefer to keep this content from being
>copied to other sites, robots.txt is the way to do it. In other words,
>you can't assume they just want to keep indexing bots out, they might
>want to keep all bots out.
>
>If your aggregator is only being used by you and a few friends, then
>probably Google et al wouldn't care if your bot visits them once per
>hour or so. But if you want this aggregator to be used by lots of
>people, then I'd say you need to respect robots.txt.


I missed the original message because it was posted from Google
Gropes, but my opinion is that *all* automated software should
retrieve and respect robots.txt. I enforce it on my server by
blocking the IP addresses of bad software at the router.

--
Ken
http://www.kensims.net/
 
Reply With Quote
 
Nikita the Spider
Guest
Posts: n/a
 
      10-08-2007
In article <1191750311.5505.12.camel@localhost>,
Newsgroups <(E-Mail Removed)> wrote:

> Thanks for your answers Nikita the Spider,
>
>
> > If your aggregator is only being used by you and a few friends,

> Currently, yes ;-( but I developped it also for anybody who want to use
> it.
>
> > But if you want this aggregator to be used by lots of
> > people, then I'd say you need to respect robots.txt.

> The problem is : where is the limit between "few friends" and "lots of
> people"...


That's where it gets tricky. =) But consider this -- if you obey
robots.txt 100% from the start, you'll always be doing the right thing
no matter how many people use your aggregator.

> > When describing robots, it focuses on indexing bots. But it was written
> > at a time when Web robots were less varied then they are now, so the
> > author may not have considered your case.

> Yes, I agree. It's another debate, and I'm not used to reed rfc, so what
> mean "Expires June 4, 1997" on this rfc ? Mean that Comments are not
> considered after this date ? If not, I could comment this rfc.


That RFC was only a draft and it expired before it was approved.
However, no other RFC governing the use of robots.txt has ever been
approved or even written as far as I know, so that RFC is the closest
thing we have to a official standard.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
 
Reply With Quote
 
Newsgroups
Guest
Posts: n/a
 
      10-08-2007
> That's where it gets tricky. =) But consider this -- if you obey
> robots.txt 100% from the start, you'll always be doing the right thing
> no matter how many people use your aggregator.


I agree ; but, If I obeyrobots.txt, my aggregator won't aggregate lots
of RSS. Who want to use an aggregator which do not aggregate

For information : There is currently about 70 users that use my
aggregator... It's difficult for me to recruite But i really wants
to be 100% conform with rules and standards...

Thanks for your help and opinion.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Can firefox open .mht files? This is the proprietary format that ie uses to save pages as web mail. John Toliver Firefox 13 11-06-2009 07:35 PM
Re: How can I change the default Language Firefox uses for File and Menu Commands? Just aFax MAm Firefox 1 11-06-2005 05:45 PM
Conflicting uses of "ip dhcp-server" -- design flaw? kenw@kmsi.net Cisco 7 08-15-2005 05:14 PM
Laptop Only Uses WiFi, Refuses to Use Wired LAN When Connected M.H. Wireless Networking 15 07-25-2005 11:01 PM
Microsoft security chief uses Firefox :-) Arne Firefox 2 08-31-2004 04:06 PM



Advertisments