AspAlliance.com LogoASPAlliance: Articles, reviews, and samples for .NET Developers
URL:
http://aspalliance.com/articleViewer.aspx?aId=236&pId=-1
.NET Screen Scraping in depth
page
by Damian Manifold
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 94001/ 84

Introduction
.NET Screen Scraping in depth
by Damian Manifold

There have been articles on ASPAlliance about data scraping, today we will be looking at the different techniques.  The WebRequest class is provided for accessing data via the web, it has two derived classes that will be looking at: Webclient and httpWebresponse.

Both classes are able to do anything you wish to do, it is more of a case of which to use for what job.

Here we will cover everything you would want to do with the two classes and see which comes out best.

Simple scraping
Simple scraping
by Damian Manifold

Here we are looking at just scraping a simple page. Where you want to do nothing but get back the page and do not have to pass up any data.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You will notice that at this level there are only small differences in the use of the classes, and at the moment Webclient seems to have the slight edge with it being slightly simpler code.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
http://authors.aspalliance.com/damianm the source page

Forms

Forms
by Damian Manifold

You have seen how simple it is to scrape any page using either webClient or httpWebResponse, today we will be looking into how you pass form data to the page you wish to scrape. 

Here we are looking at passing the form data as a query.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You will notice that the difference between the two classes is now becoming apparent. So which is better Client or Request? Well since Request passes the data on the URL it is much simpler but as you can see from the example if you have many fields things may be a little unclear. Client however is much more structured in how it passes the data, and if you wanted you could always pass the data on the URL here too. So client at the moment is looking like the better solution. 

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
form.aspx a conventional form

Posted Forms

Posted Forms
by Damian Manifold

  So you should have by now scraped a page and scraped the result of a passed form. 

Here we are looking at posting the form data.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You now see that Client now passes the form as a byte encoded string and oddly now returns the page as the same. To post data using Request you still use the same byte encoded array but have to manually open a stream and write the data yourself. This does however give you a good picture of how things actually work.

So which is the best now? Well client is still the most concise, but Request is at least consistent and clearer as to what is happening. So for posting a form it seem to be a tie.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
form.aspx a conventional posted form

Passing Headers

 

Passing Headers
by Damian Manifold

Here we are looking in how to pass values in the header. Header values go unnoticed by the users but can carry important information such as the browser type.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

Once you have looked at the above code you will see that both Client and Request both deal with headers in the same way. However with request not all values in the header can be set with this method, some more standard header values such as user-agent have there own property. This helps to make things clearer and when you start to deal with cookies this will be a big help

When you run the standard header example you will notice that not all header information can be set in code. Values such as Host are preset and cannot be modified.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
headreader.aspx standard browser headers

Scraping & passing cookies

Scraping & passing cookies
by Damian Manifold

Finally are looking in how to pass values in the cookies. One of the most important things that cookies can be used for than can cause trouble when scraping is session variables.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

Once again you will see that Client is much simpler than the Request method. It sets the cookie value directly in the header, where as Request uses a cookie container which may make things a little clearer but also as more powerful implications.

The best implication of using the cookie container is that if you are going to be scraping multiple sites you can keep all your cookies in the same container, the Request then only passes up the cookies with the corresponding domain.

So again it is a close call between which you wish to use. Since both are derived from the same base class performance is not much of an issue. But in conclusion if you are doing simple scraping webClient would appear to be the most convenient but can become unclear if you are passing lots of values in forms or cookies, httpWebRequest is a little more long winded but though its uses of classes is a little more clear. So the choice is yours.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
cookie.aspx passing cookies from jscipt

Scraping in one line using temporary objects

Scraping in one line using temporary objects
by Damian Manifold

After finishing the series on screen scraping the thought arose, how small can you make a working scrape. With this in mind the most extreme example was chosen, is a one line scrape Doable?

Surprisingly the answer is yes, and more surprisingly it is still quite clear.

Here is the example code for a conventional scrape and the one line scrape

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
OneLineScrape.aspx The single line scrape

You will notice in the conventional scrape that variables are declared for the Webclient, stream and stream reader objects, and that they are only used once before they are disposed of.

If you look at the one line example you will see that when the object is created it is no longer assigned to a variable it is just used. Placing () around the object creation allows the object to be accessed as a temporary object, since there is no other reference the object it will be disposed of after execution of the command.

Examples
WebClient.aspx the WebClient scrape in action
OneLineScrape.aspx the OneLineScrape scrape in action
http://authors.aspalliance.com/damianm/ the source page

Other Articles

Product Spotlight
Product Spotlight 

©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-19 9:59:40 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search