Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Stopping robots searching particular page

Reply
Thread Tools

Stopping robots searching particular page

 
 
dorayme
Guest
Posts: n/a
 
      09-11-2007
A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines. (A sort of provision
by a company for making some files available to those who have
the address. Company does not want password protection; but I am
considering persuading them).

What is the simplest and most effective way of stopping robots
searching a particular html pages on a server. Am looking for an
actual example and clear instructions. Getting confused by
looking at http://www.searchtools.com/index.html though doubtless
I will get less confused after much study.

--
dorayme
 
Reply With Quote
 
 
 
 
Jukka K. Korpela
Guest
Posts: n/a
 
      09-11-2007
Scripsit dorayme:

> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server.


Put the following into the head part of each of those pages:

<meta name="robots" content="noindex">

Replace "noindex" by "noindex, nofollow" if you also want to stop robots
from following any links on the page (i.e. from finding new indexable pages
through it).

This follows the de-facto standard (Robots Exclusion Standard) that has long
been obeyed by any well-behaving indexing robots. And there's not much you
can do to the ill-behaving robots.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
 
 
 
Tina Peters
Guest
Posts: n/a
 
      09-11-2007

"dorayme" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
>A website is on a server. Just one or two of the pages are not
> for public consumption. They are not top secret and no big harm
> would be done if it was not 100% possible, but it would be best
> if they did not come up in search engines.


If its not linked to any other webpage, in any way, it shouldn't be
spidered.

--Tina
--
AxisHOST.com - cPanel Hosting
BuyAVPS.com - VPS Accounts
Serving the web since 1997

 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      09-11-2007
Scripsit Tina Peters:

> If its not linked to any other webpage, in any way, it shouldn't be
> spidered.


Yet it may be spidered. Actually, it would be an interesting exercise in a
course on web issues to ask the students list down 10 possible situations
where the page might be spidered.

And to make the task a little more difficult, let's exclude the perhaps most
obvious scenario: someone who knows the page address submits it to a search
engine via its "Add URL" form.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
dorayme
Guest
Posts: n/a
 
      09-12-2007
In article <2sEFi.222058$(E-Mail Removed) >,
"Jukka K. Korpela" <(E-Mail Removed)> wrote:

> Scripsit dorayme:
>
> > What is the simplest and most effective way of stopping robots
> > searching a particular html pages on a server.

>
> Put the following into the head part of each of those pages:
>
> <meta name="robots" content="noindex">
>
> Replace "noindex" by "noindex, nofollow" if you also want to stop robots
> from following any links on the page (i.e. from finding new indexable pages
> through it).
>
> This follows the de-facto standard (Robots Exclusion Standard) that has long
> been obeyed by any well-behaving indexing robots. And there's not much you
> can do to the ill-behaving robots.


Thank you. This is the level of exclusion that I want. Job done.

--
dorayme
 
Reply With Quote
 
Sherm Pendley
Guest
Posts: n/a
 
      09-12-2007
dorayme <(E-Mail Removed)> writes:

> What is the simplest and most effective way of stopping robots
> searching a particular html pages on a server.


There are two popular "standards" (neither of which is a standard in
the formal sense). One uses <meta ...> elements in your HTML, and the
other uses separate robots.txt files. Both are described here:

<http://www.robotstxt.org/>

Both approaches depend on cooperative robots. For uncooperative robots,
all you can do is shout "klaatu barada nikto" and hope for the best.

sherm--

--
Web Hosting by West Virginians, for West Virginians: http://wv-www.net
Cocoa programming in Perl: http://camelbones.sourceforge.net
 
Reply With Quote
 
dorayme
Guest
Posts: n/a
 
      09-12-2007
In article <(E-Mail Removed)>,
Sherm Pendley <(E-Mail Removed)> wrote:

> dorayme <(E-Mail Removed)> writes:
>
> > What is the simplest and most effective way of stopping robots
> > searching a particular html pages on a server.

>
> There are two popular "standards" (neither of which is a standard in
> the formal sense). One uses <meta ...> elements in your HTML, and the
> other uses separate robots.txt files. Both are described here:
>
> <http://www.robotstxt.org/>
>
> Both approaches depend on cooperative robots. For uncooperative robots,
> all you can do is shout "klaatu barada nikto" and hope for the best.
>


Thanks. If I get any reports of the pages concerned being found
now that I have gone the meta route, I will look further into the
robots.txt approach.

(Actually, sherm, I started reading about this last before
posting my question, got restless and slightly confused and
thought, I know what to do, I will pop my head above the trench
line a mo and see if something comes back from alt.htm to make
this thing stop buzzing around my brain. I know, it was a bit
reckless. But it who dares... you know... <g>

I also have a search engine on the particular site concerned and
they have various masking procedures I have since looked into.)

--
dorayme
 
Reply With Quote
 
Travis Newbury
Guest
Posts: n/a
 
      09-12-2007
On Sep 11, 6:47 pm, "Jukka K. Korpela" <(E-Mail Removed)> wrote:
> > If its not linked to any other webpage, in any way, it shouldn't be
> > spidered.

> Yet it may be spidered. Actually, it would be an interesting exercise in a
> course on web issues to ask the students list down 10 possible situations
> where the page might be spidered.


Well that's just a stupid asignment. The students might actually be
forced to learn something from it. What the heck is your problem
suggesting something were a student could learn...

 
Reply With Quote
 
Ben C
Guest
Posts: n/a
 
      09-12-2007
On 2007-09-11, Jukka K. Korpela <(E-Mail Removed)> wrote:
> Scripsit Tina Peters:
>
>> If its not linked to any other webpage, in any way, it shouldn't be
>> spidered.

>
> Yet it may be spidered. Actually, it would be an interesting exercise in a
> course on web issues to ask the students list down 10 possible situations
> where the page might be spidered.
>
> And to make the task a little more difficult, let's exclude the perhaps most
> obvious scenario: someone who knows the page address submits it to a search
> engine via its "Add URL" form.


1. Someone posts the URL to a newsgroup.
2. You forget to turn off the webserver's AutoIndex or similar, so the
spider can just navigate its way to the url going through auto
generated directory indexes.

What are the other 8?
 
Reply With Quote
 
Jukka K. Korpela
Guest
Posts: n/a
 
      09-12-2007
Scripsit Ben C:

> 1. Someone posts the URL to a newsgroup.
> 2. You forget to turn off the webserver's AutoIndex or similar, so the
> spider can just navigate its way to the url going through auto
> generated directory indexes.
>
> What are the other 8?


To mention some other scenarios of having a page indexed without having been
linked to from any other web page*), here's one relatively obvious one and
one imaginary though realistic (we know such things are being done with
email addresses for spamming purposes):

3. The page _was_ linked to from another page.

4. An indexing robot generates URLs automatically, more or less at random,
and tries them. It might for example try servers known to exist and append
to the server name some strings that are known to be common for web pages,
like /help.htm, /news.html....

*) Of course an author cannot prevent linking by others. You tell the URL to
your friend, who tells it to his pal, who sets up a link. But this common
way of getting indexed against your will falls outside the current exercise.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Google search result to be URL-limited when searching site, but notwhen searching Web stumblng.tumblr Javascript 1 02-04-2008 09:01 AM
search robots visits or doesn`t visit this page K. HTML 4 09-14-2007 08:08 AM
OT: Opinions on Robots.txt Frankie ASP .Net 1 10-10-2005 01:21 PM
Set particular web page as startup page =?Utf-8?B?U3BlbmNlciBILiBQcnVl?= ASP .Net 1 02-19-2005 01:35 PM
Link needed to go from a non-frames page to a Frames page, and load a particular frame. How? - Newbe Philip HTML 3 06-28-2004 03:06 PM



Advertisments