The Web Robots Exclusion Standard
There is a semi-official standard for preventing robots from
visiting all or part of a website. This is the Standard for Robot Exclusion
and the details of it are at http://www.robotstxt.org/wc/norobots.html.
This standard proposes that web servers that want to change the behavior of
robots visiting the site should control the behavior through a robots.txt text
file placed in the root of the web server (i.e. http://www.foo.com/robots.txt).
Unfortunately the Standard for Robot Exclusion is not an
official standard and has never been ratified by an official Internet organization.
Furthermore, robots are under no obligation to follow the guidelines in a
robots.txt file. Consequently, a robots.txt file is of very limited use when
attempting to stop all but the most well behaved robots from visiting.
The robots meta tag
Although the web robots exclusion standard is useful for
stopping certain robots from visiting an entire website or parts of an entire
website, it is not really suited for stopping robots visiting individual pages.
The other drawback is that in order to use a robots.txt file, the file must be
placed in the root folder of the website - something that is not always
possible to do depending on the configuration of the web hosting plan or the
internal IT regulations of a large corporation.
For this reason it is sometimes better to use the robots
meta tag in individual pages of the website. The HTML required for stopping a
robot indexing a page is
<meta name="robots" content="noindex"> .
This HTML should be placed within the element of the
document.
It is also possible to stop a robot from following the links
from a particular document using the following syntax.
<meta name="robots" content="nofollow">
The two instructions can also be combined in a single meta
tag.
<meta name="robots" content="noindex, nofollow">
However, this technique of using meta tags is unlikely to
stop all but the most well behaved robots.
Make registration mandatory
If you have valuable content on your website and it is
appropriate to do so, it may be worthwhile to make all or part of the website
content only accessible once a user has logged in.
The main drawback of doing this is that preventing robots
will also stop a search engine's own web robots from visiting the website's
content which will cause your website to be less visible in search engine
catalogs. If your website relies on a significant portion of its revenue
earning traffic from search engine referrals then this technique will obviously
be counter-productive.
Do not forget that many web robots can be trained to
"log in" to websites provided they have a set of valid login
credentials, so it is essential to include some mechanism of distinguishing
between human and robotic visitors. A common means of achieving this is by
using a graphical sequence of characters that a user has to type into the form
before submission (i.e. a Captcha, see http://www.captcha.net/).
Robots are rarely able to execute JavaScript either, so configuring the
registration or login process to rely on the execution of a particular
JavaScript function could also be used.
Slowing robots down
An alternative to stopping robots altogether is to slow them
down. Many of the common legitimate robots that visit websites and obey the
robots exclusion protocol can be slowed down. For example, to slow down
Yahoo!'s robot so that it requests URLs with reduced frequency, the following
lines can be added to the robots.txt file.
User-agent: Slurp
Crawl-delay: 10
Note that Crawl-delay is measured in seconds. The web page http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
contains further details about slowing down this particular robot.
Unfortunately, there is no agreed standard for slowing down
robots, so it has to be implemented on a robot by robot basis.
For robots that do not understand any instructions to slow
down, it is possible to force them to slow down. This could be achieved by
writing a custom add-on to a website that introduces a delay in returning
content should a specific user make more than a certain number of requests in a
specific time period.
As an alternative to writing a custom add-on, it is possible
to find commercial offerings that will accomplish the same. The Slow Down
Manager ASP.NET component within VAM: Visual Input Security is able to slow
down anyone who makes repeated requests for pages and can be configured to deny
them access to the pages if they make more than a certain number of requests. Further
details about the Slow Down Manager are available from http://www.peterblum.com/VAM/VISETools.aspx#SDM.
While slowing down robots is in theory a good solution, it
is fraught with difficulties. For example, most robots can be configured to
visit websites at preset intervals. If the robot user noticed it was being
slowed down, it could simply increase the time interval between robot visits. Slowing
down website visitors based on IP address may also reduce response times for
legitimate users using the same web cache/proxy server as the robot user. Slowing
down robots by introducing a delay in the response time would also use up
processor resources while the delay was introduced.