Stopping Automated Web Robots Visiting ASP/ASP.NET Websites
page 4 of 6
by Brett Burridge
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 44322/ 42

Techniques for Stopping Robots

The Web Robots Exclusion Standard

There is a semi-official standard for preventing robots from visiting all or part of a website.  This is the Standard for Robot Exclusion and the details of it are at http://www.robotstxt.org/wc/norobots.html.  This standard proposes that web servers that want to change the behavior of robots visiting the site should control the behavior through a robots.txt text file placed in the root of the web server (i.e. http://www.foo.com/robots.txt).

 

Unfortunately the Standard for Robot Exclusion is not an official standard and has never been ratified by an official Internet organization.  Furthermore, robots are under no obligation to follow the guidelines in a robots.txt file.  Consequently, a robots.txt file is of very limited use when attempting to stop all but the most well behaved robots from visiting.

The robots meta tag

Although the web robots exclusion standard is useful for stopping certain robots from visiting an entire website or parts of an entire website, it is not really suited for stopping robots visiting individual pages.  The other drawback is that in order to use a robots.txt file, the file must be placed in the root folder of the website - something that is not always possible to do depending on the configuration of the web hosting plan or the internal IT regulations of a large corporation.

For this reason it is sometimes better to use the robots meta tag in individual pages of the website.  The HTML required for stopping a robot indexing a page is

<meta name="robots" content="noindex"> .

This HTML should be placed within the element of the document.

It is also possible to stop a robot from following the links from a particular document using the following syntax.

<meta name="robots" content="nofollow">

The two instructions can also be combined in a single meta tag.

<meta name="robots" content="noindex, nofollow">

However, this technique of using meta tags is unlikely to stop all but the most well behaved robots.

Make registration mandatory

If you have valuable content on your website and it is appropriate to do so, it may be worthwhile to make all or part of the website content only accessible once a user has logged in.

The main drawback of doing this is that preventing robots will also stop a search engine's own web robots from visiting the website's content which will cause your website to be less visible in search engine catalogs.  If your website relies on a significant portion of its revenue earning traffic from search engine referrals then this technique will obviously be counter-productive.

Do not forget that many web robots can be trained to "log in" to websites provided they have a set of valid login credentials, so it is essential to include some mechanism of distinguishing between human and robotic visitors.  A common means of achieving this is by using a graphical sequence of characters that a user has to type into the form before submission (i.e. a Captcha, see http://www.captcha.net/).  Robots are rarely able to execute JavaScript either, so configuring the registration or login process to rely on the execution of a particular JavaScript function could also be used.

Slowing robots down

An alternative to stopping robots altogether is to slow them down.  Many of the common legitimate robots that visit websites and obey the robots exclusion protocol can be slowed down. For example, to slow down Yahoo!'s robot so that it requests URLs with reduced frequency, the following lines can be added to the robots.txt file.

User-agent: Slurp
 Crawl-delay: 10

Note that Crawl-delay is measured in seconds.  The web page http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html contains further details about slowing down this particular robot.

Unfortunately, there is no agreed standard for slowing down robots, so it has to be implemented on a robot by robot basis.

For robots that do not understand any instructions to slow down, it is possible to force them to slow down.  This could be achieved by writing a custom add-on to a website that introduces a delay in returning content should a specific user make more than a certain number of requests in a specific time period.

As an alternative to writing a custom add-on, it is possible to find commercial offerings that will accomplish the same.  The Slow Down Manager ASP.NET component within VAM: Visual Input Security is able to slow down anyone who makes repeated requests for pages and can be configured to deny them access to the pages if they make more than a certain number of requests.  Further details about the Slow Down Manager are available from http://www.peterblum.com/VAM/VISETools.aspx#SDM.

While slowing down robots is in theory a good solution, it is fraught with difficulties.  For example, most robots can be configured to visit websites at preset intervals.  If the robot user noticed it was being slowed down, it could simply increase the time interval between robot visits.  Slowing down website visitors based on IP address may also reduce response times for legitimate users using the same web cache/proxy server as the robot user.  Slowing down robots by introducing a delay in the response time would also use up processor resources while the delay was introduced.


View Entire Article

User Comments

No comments posted yet.

Product Spotlight
Product Spotlight 





Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-18 8:51:39 PM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search