Stopping Automated Web Robots Visiting ASP/ASP.NET Websites
page 5 of 6
by Brett Burridge
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 44314/ 54

Obfuscating content

While stopping robots from visiting is one solution, the other is to make your website a lot less useful to them.  This can be achieved by either making the website structure difficult to navigate, or by obfuscating the content so that it is more difficult to parse and extract content.

Obfuscating the content of the website

Using ASP.NET

A straightforward way of making life more difficult for robots is to use the .NET Framework.  The HTML produced by ASP.NET can be more difficult to parse than that created using classic ASP. This is particularly so if the content the robots are interested in can only be displayed after a form is posted back.  The .NET Framework gives form fields names such as _ctl10__ctl1_DropDownListPrice which can often be inconsistent if the page contains different numbers of controls each time it is viewed or it contains controls with many subcontrols within them, such as DataGrids.

Using JavaScript

As mentioned previously, few (if any) robots are able to execute JavaScript.  Building the website's navigation scheme using JavaScript could, therefore, be used to disguise the website's navigation structure from robots.  This does of course have the consequence of making the website's content less visible to search engine robots.  The JavaScript navigation system will also only work in web browsers where JavaScript is enabled and there are also accessibility issues to consider.

Blocking robot user-agents

Most requests made to a web server will contain a description of the web browser or automated web robot being used - the "user agent string."  This description can be accessed via the HTTP_USER_AGENT server variable, Request.ServerVariables ("HTTP_USER_AGENT"), in either VBScript in classic ASP or VB.NET in ASP.NET.  Most legitimate robots will identify themselves.  For example, Google's content retrieval robot identifies itself as

Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html).

A web browser will generally identify itself as something like

Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0).

However, there are now so many variants of the user agent string that it can be difficult to keep up with things.  Classic ASP used to have a Browser Capabilities component that could be used to identify web browsers, but it relied on manually updating the server's browscap.ini file as new web browsers were released.

Commercial alternatives to the Browser Capabilities component are often much better at identification of user agents.  Of the various commercial offerings, BrowserHawk is probably the best known.  Its ASP component contains a Crawler property that can be used to determine if the client is a robot.  A review of BrowserHawk is available on the ASPAlliance website.

While in theory using the user agent string to identify and block robots is possible, it is possible for the users of robots to "fake" the user agent string.  The usual method of accomplishing this is to use a user agent string from a commonly used web browser such as Internet Explorer 6 on Windows.  The web server is then unable to distinguish it from the normal website users unless more sophisticated robot detection techniques are employed.

A further problem is that an increasing number of proxy servers have been configured to strip out information, such as the user agent string from the request, so it is not uncommon to see the user agent masked or absent altogether.

Robot honey pot

Since the user agent string is open to abuse, a more sophisticated method of stopping robots is required.

One way of achieving this is by looking for website visitors that request a high ratio of pages to other content such as images.  Robots are primarily interested in text content, so this is a good way of identifying robots.  The downside to this is that it is not straightforward to accomplish this through ASP or ASP.NET, but it can be accomplished by analysis of the web server's logfiles. Analysis of robot behavior in log files can be carried out using Microsoft's Log Parser. Alternatively, the analysis could potentially be done in near real time by making use of an ISAPI filter to log requests as they are made to the web server.  Logging website requests to a SQL Server could also be used, but for large websites this would require substantial SQL Server resources to log the amount of data generated.

A variant of this is to look for website visitors that just request the dynamic parts of the site.  For example, an online store may have product catalog pages that robots will tend to visit in order to extract the product details and republish on another site, such as a shopping comparison site.  The exact pattern of robot usage will tend to vary depending on the type of content offered by the website.

Instead of looking through log files, an alternative for identifying robots is to put a hidden link on a page which only robots will follow.  This link could then take the robot to an ASP page that logs its IP address to a database.  Of course, this technique cannot be effective against robots just visiting specific pages within the website, but it is reasonably good at identifying robots that crawl entire websites.

Once a robot has been identified then it can be blocked from the site.  The usual method of this is to prevent requests from the robot's IP address.


View Entire Article

User Comments

No comments posted yet.

Product Spotlight
Product Spotlight 





Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-19 9:28:00 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search