While stopping robots from visiting is one solution, the
other is to make your website a lot less useful to them. This can be achieved
by either making the website structure difficult to navigate, or by obfuscating
the content so that it is more difficult to parse and extract content.
Obfuscating the content of the website
Using ASP.NET
A straightforward way of making life more difficult for
robots is to use the .NET Framework. The HTML produced by ASP.NET can be more
difficult to parse than that created using classic ASP. This is particularly so
if the content the robots are interested in can only be displayed after a form
is posted back. The .NET Framework gives form fields names such as
_ctl10__ctl1_DropDownListPrice which can often be inconsistent if the page
contains different numbers of controls each time it is viewed or it contains
controls with many subcontrols within them, such as DataGrids.
Using JavaScript
As mentioned previously, few (if any) robots are able to
execute JavaScript. Building the website's navigation scheme using JavaScript
could, therefore, be used to disguise the website's navigation structure from
robots. This does of course have the consequence of making the website's
content less visible to search engine robots. The JavaScript navigation system
will also only work in web browsers where JavaScript is enabled and there are
also accessibility issues to consider.
Blocking robot user-agents
Most requests made to a web server will contain a
description of the web browser or automated web robot being used - the
"user agent string." This description can be accessed via the HTTP_USER_AGENT
server variable, Request.ServerVariables ("HTTP_USER_AGENT"), in
either VBScript in classic ASP or VB.NET in ASP.NET. Most legitimate robots
will identify themselves. For example, Google's content retrieval robot
identifies itself as
Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html).
A web browser will generally identify itself as something
like
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0).
However, there are now so many variants of the user agent
string that it can be difficult to keep up with things. Classic ASP used to
have a Browser Capabilities component that could be used to identify web
browsers, but it relied on manually updating the server's browscap.ini file as
new web browsers were released.
Commercial alternatives to the Browser Capabilities
component are often much better at identification of user agents. Of the
various commercial offerings, BrowserHawk
is probably the best known. Its ASP component contains a Crawler property that
can be used to determine if the client is a robot. A review of BrowserHawk is
available on the ASPAlliance website.
While in theory using the user agent string to identify and
block robots is possible, it is possible for the users of robots to "fake"
the user agent string. The usual method of accomplishing this is to use a user
agent string from a commonly used web browser such as Internet Explorer 6 on
Windows. The web server is then unable to distinguish it from the normal
website users unless more sophisticated robot detection techniques are
employed.
A further problem is that an increasing number of proxy
servers have been configured to strip out information, such as the user agent
string from the request, so it is not uncommon to see the user agent masked or
absent altogether.
Robot honey pot
Since the user agent string is open to abuse, a more
sophisticated method of stopping robots is required.
One way of achieving this is by looking for website visitors
that request a high ratio of pages to other content such as images. Robots are
primarily interested in text content, so this is a good way of identifying
robots. The downside to this is that it is not straightforward to accomplish
this through ASP or ASP.NET, but it can be accomplished by analysis of the web
server's logfiles. Analysis of robot behavior in log files can be carried out
using Microsoft's Log
Parser. Alternatively, the analysis could potentially be done in near real
time by making use of an ISAPI filter to log requests as they are made to the
web server. Logging website requests to a SQL Server could also be used, but
for large websites this would require substantial SQL Server resources to log
the amount of data generated.
A variant of this is to look for website visitors that just
request the dynamic parts of the site. For example, an online store may have
product catalog pages that robots will tend to visit in order to extract the
product details and republish on another site, such as a shopping comparison
site. The exact pattern of robot usage will tend to vary depending on the type
of content offered by the website.
Instead of looking through log files, an alternative for
identifying robots is to put a hidden link on a page which only robots will
follow. This link could then take the robot to an ASP page that logs its IP
address to a database. Of course, this technique cannot be effective against
robots just visiting specific pages within the website, but it is reasonably
good at identifying robots that crawl entire websites.
Once a robot has been identified then it can be blocked from
the site. The usual method of this is to prevent requests from the robot's IP
address.