Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > robot.txt

Reply
Thread Tools

robot.txt

 
 
PeterMcC
Guest
Posts: n/a
 
      06-28-2003
David Graham wrote:
> "Denise Enck" <(E-Mail Removed)> wrote in message
> news:tQiLa.69023$(E-Mail Removed) thlink.net...
>> "David Graham" <(E-Mail Removed)> wrote in message
>> news:n6eLa.339$(E-Mail Removed)...
>>> Hi
>>> I have a folder on my site that I use to practice on, I don't want
>>> robots indexing this folder. I believe the meta tag is not as good
>>> as a robot.txt file. I would like to use a robot.txt file but...
>>>
>>> 1. What is the syntax of the line that I write to prevent access to
>>> a folder (the folder is called 'sefriendly' and it lives off the
>>> root folder

> which
>> is
>>> called 'www'
>>>
>>> 2. In which folder is the robot.txt file stored?
>>>
>>> thanks
>>>
>>> David
>>>

>>
>>
>> the file should be called robots.txt rather than robot.txt else it
>> won't keep any spiders out ~
>>
>> Denise
>>

> Thanks loads - didn't know it had to have the the 's' on the name


Ooops - picked up the "robot.txt" from the OP and it didn't register.
Thanks, Denise.

--
PeterMcC
If you feel that any of the above is incorrect,
inappropriate or offensive in any way,
please ignore it and accept my apologies.

 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-29-2003
Headless <(E-Mail Removed)> wrote:

> "Jukka K. Korpela" <(E-Mail Removed)> wrote:
>
>>> the file should be called robots.txt rather than robot.txt else
>>> it won't keep any spiders out ~

>>
>>Besides, it needs to reside in the _server root_. Normal authors
>>have no access to it, unless they run their own server.

>
> That would be silly and it would make the concept practically
> unusable.


_What_ would be silly? The robots.txt concept _is_ defined the way I
described, both in the HTML specification I referred to and in the
"Robots Exclusion Standard".

> I'm on a bog standard shared Apache user web space provided with my
> dial account (so virtual root). Using a robots.txt works fine (I
> can see that it works because I use Atomz site search on one of my
> sites, it echos back the robots.txt exclusions as it indexes the
> site).


What you see is what the Atomz software does. Everyone and his dog or
search system may use a name like robots.txt, or robot.txt, or
foo.bar for some private purposes. But that's _not_ what the Robots
Exclusion Standard for the World Wide Web means.

Don't get lured by statements of compliance. On the average, any
statement about complying with some standard is bogus.

If Atomz actually uses robots.txt other than at the server root, then
http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
mildly. It says: "Yes, Atomz Search is compliant with the Robots
Exclusion Protocol and it will examine the robots.txt file if it is
present on your site." and refers to common resources on that
protocol/standard. And those resources make it clear that robots.txt is
_server-wide_, residing at address /robots.txt. In particular,
http://www.robotstxt.org/wc/faq.html#noindex
says:
"What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't
administer the entire server. All is not lost: there is a new standard
for using HTML META tags to keep robots out of your documents. - -"

(Of course, "sometimes" and "new" are somewhat funny words in this
context.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
 
 
 
Headless
Guest
Posts: n/a
 
      06-29-2003
"Jukka K. Korpela" <(E-Mail Removed)> wrote:

>>>Besides, it needs to reside in the _server root_. Normal authors
>>>have no access to it, unless they run their own server.

>>
>> That would be silly and it would make the concept practically
>> unusable.

>
>_What_ would be silly? The robots.txt concept _is_ defined the way I
>described, both in the HTML specification I referred to and in the
>"Robots Exclusion Standard".


Afaics you read to much into references to "/" and " only a server
administrator can maintain such a list". "/" refers to the root of my
web space, and I am the "server administrator" (virtually .

Afaik there is no way for a robot to access the physical server root (as
opposed to the virtual server root).


Headless

 
Reply With Quote
 
David Graham
Guest
Posts: n/a
 
      06-29-2003

"Jukka K. Korpela" <(E-Mail Removed)> wrote in message
news:Xns93A962B0A5029jkorpelacstutfi@193.229.0.31. ..
> Headless <(E-Mail Removed)> wrote:
>
> > "Jukka K. Korpela" <(E-Mail Removed)> wrote:
> >
> >>> the file should be called robots.txt rather than robot.txt else
> >>> it won't keep any spiders out ~
> >>
> >>Besides, it needs to reside in the _server root_. Normal authors
> >>have no access to it, unless they run their own server.

> >
> > That would be silly and it would make the concept practically
> > unusable.

>
> _What_ would be silly? The robots.txt concept _is_ defined the way I
> described, both in the HTML specification I referred to and in the
> "Robots Exclusion Standard".
>
> > I'm on a bog standard shared Apache user web space provided with my
> > dial account (so virtual root). Using a robots.txt works fine (I
> > can see that it works because I use Atomz site search on one of my
> > sites, it echos back the robots.txt exclusions as it indexes the
> > site).

>
> What you see is what the Atomz software does. Everyone and his dog or
> search system may use a name like robots.txt, or robot.txt, or
> foo.bar for some private purposes. But that's _not_ what the Robots
> Exclusion Standard for the World Wide Web means.
>
> Don't get lured by statements of compliance. On the average, any
> statement about complying with some standard is bogus.
>
> If Atomz actually uses robots.txt other than at the server root, then
> http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
> mildly. It says: "Yes, Atomz Search is compliant with the Robots
> Exclusion Protocol and it will examine the robots.txt file if it is
> present on your site." and refers to common resources on that
> protocol/standard. And those resources make it clear that robots.txt is
> _server-wide_, residing at address /robots.txt. In particular,
> http://www.robotstxt.org/wc/faq.html#noindex
> says:
> "What if I can't make a /robots.txt file?
> Sometimes you cannot make a /robots.txt file, because you don't
> administer the entire server. All is not lost: there is a new standard
> for using HTML META tags to keep robots out of your documents. - -"
>
> (Of course, "sometimes" and "new" are somewhat funny words in this
> context.)
>
> --
> Yucca, http://www.cs.tut.fi/~jkorpela/
> Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


Yucca has my respect, his answers are good, but Headless is no dummy either.
Has Headless conceded defeat on this one? Anyway, I will be adding the meta
tag exclusion thing to every page. Thanks to everyone who helped.
David


 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-29-2003
Headless <(E-Mail Removed)> wrote:

> Afaics you read to much into references to "/" and " only a server
> administrator can maintain such a list". "/" refers to the root of
> my web space, and I am the "server administrator" (virtually .


No, I don't. The meaning of a URL that begins with "/" is well-defined
in URL specifications, and this part of the specs is honored by all
relevant parties. The meaning of "/robots.txt" only depends on the
server part of the base address, and the meaning is
http://www.sample.example/robots.txt
where www.sample.example is the server part of the base address.
There's no vagueness here. Ref.: RFC 2396.

And the Robots Exclusion Standard defines that URL only as the
residence of the file for exclusion specifications.

> Afaik there is no way for a robot to access the physical server
> root (as opposed to the virtual server root).


The only thing that a robot, or a browser for that matter, knows and
cares is that it sends a request for
http://www.sample.example/robots.txt
How the server www.sample.example processes it is its business. For all
that robots (or browsers) can know, the server might pick up file
vdsdghuigae.fig from folder yhftgy\dahjks\fhgj, transmogrify its
content, and send back the result. Or it might run a server-side script
to generate something. Or it might connect to typing machines operated
by chimpanzees and record and send back what they are currently
producing.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
David Graham
Guest
Posts: n/a
 
      06-29-2003

"lostinspace" <(E-Mail Removed)> wrote in message
news:UNALa.3268$(E-Mail Removed) .com...
> ----- Original Message -----
> From: David Graham <>
> Newsgroups: alt.html
> Sent: Saturday, June 28, 2003 6:23 AM
> Subject: robot.txt
>
>
> > Hi
> > I have a folder on my site that I use to practice on, I don't want

robots
> > indexing this folder. I believe the meta tag is not as good as a

robot.txt
> > file. I would like to use a robot.txt file but...
> >
> > 1. What is the syntax of the line that I write to prevent access to a

> folder
> > (the folder is called 'sefriendly' and it lives off the root folder

which
> is
> > called 'www'
> >
> > 2. In which folder is the robot.txt file stored?
> >
> > thanks
> >
> > David
> >
> >

>
> David,
> Perhaps it's just an off day for most folks?
> I've seen some very knowlegable folks here provide incomplete information.
>
> Robots.txt will NOT ban any robot.
> Instead, it is a "suggestion" to honorable bots to comply.
> Most dishonorbale bots won't read your robots.txt anyway. Any path in

there
> will only point them towards the possibly hidden and unprotected

direction.
> Jdmorgan has some extensive suggestion on robots:
> http://www.webmasterworld.com/forum23/2200.htm
>
> On the other hand if your interested in banning and denying admission of
> bots than in most instances that requires the use of htaccess.
> See the "Close to Perfect Ban"
> http://www.webmasterworld.com/forum1...ht=perfect+ban a very
> long thread.
>

Thanks, I will read the links. I thought this robots.txt post would just be
a simple little matter - perhaps not!
thanks
David


 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-29-2003
Jacqui or (maybe) Pete <(E-Mail Removed)> wrote:

> The spec at http://www.robotstxt.org isn't exactly clear on
> anything
>
> http://www.robotstxt.org/wc/norobots.html#method says:
>
> 'The method used to exclude robots from a server is to create a
> file on
> the server which specifies an access policy for robots. This file
> must be accessible via HTTP on the local URL "/robots.txt".'


The only thing that isn't quite clear IMHO is why they call it "local
URL" when they apparently mean _relative_ URL, which _must_ be globally
accessible of course. But URL terminology is generally confused, and
the intentions are clear.

> Now what does that mean? Take porjes.com/robots.txt [1]. Its
> intention is *not* to ask robots to exclude files from the server
> (ananke.affordablehost.com). However it _is_ accessible at the URL
> http://porjes.com/robots.txt.


By the robots exclusion standard, it _is_ such a resource that is to be
used for restricting robot access to any URLs that begin with
http://porjes.com/ (and only them). Physical servers are irrelevant in
URL considerations.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      06-30-2003
Headless <(E-Mail Removed)> wrote:

> please clarify the following phrases:
>
> _server root_


The address http://www.foo.example/ or the physical directory
corresponding to it, depending on whether you consider the situation
from the robot and client perspective or the author perspective.

> Normal authors


The majority of Web authors who just create (and possibly maintain)
pages and try to avoid knowing about any server issues.

> own server


An server controlled by the person in question.

> Folk in this group typically host their websites on a shared
> server. This presents no problems with regard to using a robots.txt
> as long as they have their own domain or if the site has this type
> of url: http://www.user.host.com


Folk in this group maybe (I have no statistics on this), but surely
most people who create pages just put them somewhere without owning a
domain.

In the situation you describe, thought perhaps not with the particular
URL you mention (domain host.com exists, but subdomain user.host.com
doesn't [there's an implicit hint here, suggesting that sample URLs
should be flagged as such using .example]), the author has control over
the server root. So I was inexact in that "unless they run their own
server", in the sense that it need not be a separate HTTP server but
can be a server "only" from the viewpoint of everyone else

> The only situation that does present a problem is if the site has
> this type of url: http://www.host.com/~user


In that particular case, it simply depends on
http://www.host.com/.htaccess, which does not currently exist.

But there is a _very_ common situation where an author has control over
a single page, or set of pages, like
http://www.foo.example/somestuff/...
where ... denotes an arbitrary string. If he creates
http://www.foo.example/somestuff/robots.txt
it won't affect normal indexing robots the least (though it might
affect Atomz). He would need to talk to http://www.velocityreviews.com/forums/(E-Mail Removed)ple to make
her modify http://www.foo.example/robots.txt. Or, more realistically,
he would just use <meta name="robots" ...> tags.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


 
Reply With Quote
 
PeterMcC
Guest
Posts: n/a
 
      06-30-2003
David Graham wrote:
> "Jukka K. Korpela" <(E-Mail Removed)> wrote in message
> news:Xns93A9A09F49E7Ajkorpelacstutfi@193.229.0.31. ..
>> Jacqui or (maybe) Pete <(E-Mail Removed)> wrote:
>>
>>> The spec at http://www.robotstxt.org isn't exactly clear on
>>> anything

>
> I can't follow most of this thread, could you very simply, in
> non-technical jargon, just confirm if robots.txt is any good or not!
> If it helps, I own the domain
> http://www.catalysys.co.uk
> which is hosted by phpwebhosting.


As far as your implementation of the robots.txt file is concerned, it looks
to be the correct way to *ask* the spiders not to index the sefriendly
folder.

User-agent: *
Disallow: /sefriendly/

Most search engines seem to adhere to the rules but, as has been pointed
out, robots.txt doesn't present any barrier other than putting up a keep-out
sign.

If you don't have a link to the page from an already spidered site, your
sefriendly directory won't be found anyway - robots.txt or not.

And, if you really want to be safe, you could always password protect the
directory with .htaccess - dead easy and the spiders don't get past the
password protect.

--
PeterMcC
If you feel that any of the above is incorrect,
inappropriate or offensive in any way,
please ignore it and accept my apologies.

 
Reply With Quote
 
Headless
Guest
Posts: n/a
 
      06-30-2003
"Jukka K. Korpela" <(E-Mail Removed)> wrote:

>> please clarify the following phrases:
>>
>> _server root_

>
>The address http://www.foo.example/ or the physical directory
>corresponding to it, depending on whether you consider the situation
>from the robot and client perspective or the author perspective.


"Server root" means something entirely different from a sysadmin angle.
I suggest using a different terminology to remove the ambiguity,
"(sub)domain root" seems more appropriate.

>> Normal authors

>
>The majority of Web authors who just create (and possibly maintain)
>pages and try to avoid knowing about any server issues.


Assuming that "The majority of Web authors" use
http://www.host.com/~user url's is a very bold claim.

>> own server

>
>An server controlled by the person in question.


I don't control any "server", yet usage of robots.txt on my site is
fully valid, correct and functioning.

>> Folk in this group typically host their websites on a shared
>> server. This presents no problems with regard to using a robots.txt
>> as long as they have their own domain or if the site has this type
>> of url: http://www.user.host.com

>
>Folk in this group maybe (I have no statistics on this), but surely
>most people who create pages just put them somewhere without owning a
>domain.


Again there is a risk of ambiguity here, http://www.user.host.com should
be labeled as a "sub-domain", it's not registered anywhere and it's not
portable, so you certainly can not call it "owning a domain".

>> The only situation that does present a problem is if the site has
>> this type of url: http://www.host.com/~user

>
>In that particular case, it simply depends on
>http://www.host.com/.htaccess, which does not currently exist.


I don't see how the robots.txt convention relates to Apache .htaccess
files. Regardless of any .htaccess file anywhere,
http://www.host.com/~user would resolve to
http://www.host.com/robots.txt for compliant clients looking for a
robots.txt

>http://www.foo.example/somestuff/robots.txt
>it won't affect normal indexing robots the least (though it might
>affect Atomz).


You have not provided any evidence that Atomz does not follow the
correct procedure for retrieving a robots.txt. It works correctly on my
site because it should (all my sites use http://www.user.host.com urls).


Headless

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments