.NET Screen Scraping in depth
 
Published: 30 Oct 2003
Unedited - Community Contributed
Abstract
Everything you need to know about screen scraping, from simply pulling down a page to more complex issues like submitting forms and cookies. Here you will learn how to use the Webclient and httpWebresponse classes and which is better for what task.
by Damian Manifold
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 150404/ 676

Introduction

.NET Screen Scraping in depth
by Damian Manifold

There have been articles on ASPAlliance about data scraping, today we will be looking at the different techniques.  The WebRequest class is provided for accessing data via the web, it has two derived classes that will be looking at: Webclient and httpWebresponse.

Both classes are able to do anything you wish to do, it is more of a case of which to use for what job.

Here we will cover everything you would want to do with the two classes and see which comes out best.

Simple scraping

Simple scraping
by Damian Manifold

Here we are looking at just scraping a simple page. Where you want to do nothing but get back the page and do not have to pass up any data.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You will notice that at this level there are only small differences in the use of the classes, and at the moment Webclient seems to have the slight edge with it being slightly simpler code.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
http://authors.aspalliance.com/damianm the source page

Forms

Forms
by Damian Manifold

You have seen how simple it is to scrape any page using either webClient or httpWebResponse, today we will be looking into how you pass form data to the page you wish to scrape. 

Here we are looking at passing the form data as a query.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You will notice that the difference between the two classes is now becoming apparent. So which is better Client or Request? Well since Request passes the data on the URL it is much simpler but as you can see from the example if you have many fields things may be a little unclear. Client however is much more structured in how it passes the data, and if you wanted you could always pass the data on the URL here too. So client at the moment is looking like the better solution. 

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
form.aspx a conventional form

Posted Forms

Posted Forms
by Damian Manifold

  So you should have by now scraped a page and scraped the result of a passed form. 

Here we are looking at posting the form data.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

You now see that Client now passes the form as a byte encoded string and oddly now returns the page as the same. To post data using Request you still use the same byte encoded array but have to manually open a stream and write the data yourself. This does however give you a good picture of how things actually work.

So which is the best now? Well client is still the most concise, but Request is at least consistent and clearer as to what is happening. So for posting a form it seem to be a tie.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
form.aspx a conventional posted form

Passing Headers

 

Passing Headers
by Damian Manifold

Here we are looking in how to pass values in the header. Header values go unnoticed by the users but can carry important information such as the browser type.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

Once you have looked at the above code you will see that both Client and Request both deal with headers in the same way. However with request not all values in the header can be set with this method, some more standard header values such as user-agent have there own property. This helps to make things clearer and when you start to deal with cookies this will be a big help

When you run the standard header example you will notice that not all header information can be set in code. Values such as Host are preset and cannot be modified.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
headreader.aspx standard browser headers

Scraping & passing cookies

Scraping & passing cookies
by Damian Manifold

Finally are looking in how to pass values in the cookies. One of the most important things that cookies can be used for than can cause trouble when scraping is session variables.

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
httpWebRequest.aspx A page to hold the scraped data
httpWebRequest.aspx.vb Code behind to scrape the data using httpWebRequest

Once again you will see that Client is much simpler than the Request method. It sets the cookie value directly in the header, where as Request uses a cookie container which may make things a little clearer but also as more powerful implications.

The best implication of using the cookie container is that if you are going to be scraping multiple sites you can keep all your cookies in the same container, the Request then only passes up the cookies with the corresponding domain.

So again it is a close call between which you wish to use. Since both are derived from the same base class performance is not much of an issue. But in conclusion if you are doing simple scraping webClient would appear to be the most convenient but can become unclear if you are passing lots of values in forms or cookies, httpWebRequest is a little more long winded but though its uses of classes is a little more clear. So the choice is yours.

Examples
WebClient.aspx the WebClient scrape in action
httpWebrequest.aspx the httpWebrequest scrape in action
cookie.aspx passing cookies from jscipt

Scraping in one line using temporary objects

Scraping in one line using temporary objects
by Damian Manifold

After finishing the series on screen scraping the thought arose, how small can you make a working scrape. With this in mind the most extreme example was chosen, is a one line scrape Doable?

Surprisingly the answer is yes, and more surprisingly it is still quite clear.

Here is the example code for a conventional scrape and the one line scrape

Source Code
WebClient.aspx A page to hold the scraped data
WebClient.aspx.vb Code behind to scrape the data using WebClient
OneLineScrape.aspx The single line scrape

You will notice in the conventional scrape that variables are declared for the Webclient, stream and stream reader objects, and that they are only used once before they are disposed of.

If you look at the one line example you will see that when the object is created it is no longer assigned to a variable it is just used. Placing () around the object creation allows the object to be accessed as a temporary object, since there is no other reference the object it will be disposed of after execution of the command.

Examples
WebClient.aspx the WebClient scrape in action
OneLineScrape.aspx the OneLineScrape scrape in action
http://authors.aspalliance.com/damianm/ the source page

Other Articles



User Comments

Title: Awesome Article   
Name: Sandy Grupe
Date: 2012-06-22 4:19:09 AM
Comment:
Found it usefule
Title: Thanks for sharing this   
Name: Evgeniy
Date: 2011-06-14 2:28:07 PM
Comment:
Nice article. Usually i use commerce libraries for .net scraping, like Scraper or Gogybot. I think your article describe how this libraries are work internally.
Thanks!
Title: error in links   
Name: harish
Date: 2011-04-07 8:11:53 AM
Comment:
most of the links in this website are not working
Title: Frustating code   
Name: Milind
Date: 2010-05-14 3:28:08 AM
Comment:
The article seems to be good and looks like covering all the aspects. But damn, I am not able to see the code !! It just keep giving 503 error.

Hello Admin/Webmaster,Any reason? or when can we expect this to be fixed?
Thanks
Milind
Title: Great Article   
Name: Seamus McMahon
Date: 2010-04-21 12:24:01 PM
Comment:
This article is very useful. I have been reading up on screen scraping and in particular entering data into forms but there is very little of the subject covered in any books. This is has been very helpful
Title: mr   
Name: G P Zob
Date: 2009-12-03 6:50:51 AM
Comment:
what happens if the page you are scraping errors? Do you get the error page in the response stream? No. So how do you display the scraped error page in the scraping page? Any ideas?
Title: How does this translate for using with Siebel?   
Name: Marc Tucker
Date: 2009-10-13 12:23:09 PM
Comment:
I am interested in the code you use to do this in Siebel. I am using Siebel 7.8 currently, and my manager has asked if we can screen scrape data from the siebel screen to populate some notifications to our sales reps. I can use your example to scrape basic info, but how do I drill into the specific frame and or object to get the date I'm wanting to retrieve?
Title: Service Unavailable error   
Name: H Yeung
Date: 2009-09-08 5:31:48 PM
Comment:
Is the service down? I received service unavailable error when I tried to see the source.
Title: Scrap Specifically   
Name: Ross
Date: 2008-04-04 1:54:26 PM
Comment:
I want to scrap only specific content of site.Is it possible?
Title: Scraping w/o request   
Name: John McKenney
Date: 2008-03-07 8:34:05 AM
Comment:
How to do implement a scrape if you cannot request the URI. Meaning, I have to read a static HTML page served by Peoplesoft, I cannot request the URL, I alreasy have the page. I have a VB.Net app that I want to read a certain peice of data from that static page. Any pointers?
Title: Screen scraping from embedded actix controls?   
Name: Marc Tucker
Date: 2008-02-24 7:25:07 AM
Comment:
I have a siebel application that we copy and paste values out of and into another app made in vb.net. We've sent in an enhancement request to the dev team to get the data an easier way but it's not a top priority for them. Can we use screen scraping to extract the data from the specific applet in question? Siebel uses frames within frames within frames also. I have tried mapping out the frames to access the data via the DOM but that isn't getting me the info I want and need to know.
Title: Correct   
Name: cdahlkvist
Date: 2007-12-18 11:51:17 AM
Comment:
Yes, sorry, I decided to just paste his code. I actually used System.Net.HttpWebRequest.

The problem was with my .Net assembly folder. I did a 2.0 repair and it worked fine after.

My apologies.
Title: Fortunate   
Name: Brendan
Date: 2007-12-18 9:14:39 AM
Comment:
It isn't Article.HttpWebRequest. that is why you get an error. You should be using System.Net.HttpWebRequest like the author does in this article.
Title: Unfortunate   
Name: cdahlkvist
Date: 2007-12-17 2:36:42 PM
Comment:
This doesn't actually work. Consistently getting errors as follows:

Could not load type 'Article.httpWebRequest'
Title: screen scrapping with all the links having absolute path.   
Name: Ross
Date: 2007-12-07 9:57:12 AM
Comment:
I need more information on scrapped with the links having absolute path.So, that they can be mapped with the local web application.
Title: re:Webscrap a Website   
Name: DamianM
Date: 2007-11-09 6:13:56 AM
Comment:
I would be hard for me to say if your doing anything wrong this would depend on the site you are scarping. You need to mimic 100% what the browser is doing. It is possible to do what you want, it is just trick some times.
Title: Webscrap a Website   
Name: Sandeep
Date: 2007-11-09 3:33:25 AM
Comment:
I tried passing values, but did'nt worked am i doing anything wrong?. What I want the webscraper program to do is pass loginid and password to the login page and invoke the "LonIn" button click event so that I get the response and then the page after login page is called, is it possible?
Title: re:Webscrap a Website   
Name: DamianM
Date: 2007-11-08 4:33:42 AM
Comment:
Read the passing forms section. The password and user name are probably passed as a form. To simulate pressing the submit button, you will need to pass the form to whatever url the form is submitted.
Title: Webscrap a Website   
Name: Sandeep
Date: 2007-11-08 2:41:47 AM
Comment:
Can we pass login id and password to a particular website and invoke the button click event, if yes how?

I want to do web scraping for web site which asks for a userid and password (which i have) how do i pass this info to the website, also how do i invoke the button click event, so that it will execute the code behind that button and give a response.
Also once in i want to perform various task like buying a product out of many and finaly make payment using credit card, all this needs to done using web scraping.
Title: Re:web Scraping   
Name: DamianM
Date: 2007-11-05 7:00:56 AM
Comment:
>How can I run the code (on target URL e.g. Login Page) written on button click using web scraping?

There no set answer,you would have to mimic what the button click did.
Title: web Scraping   
Name: Sandeep
Date: 2007-11-03 8:07:09 AM
Comment:
How can I run the code (on target URL e.g. Login Page) written on button click using web scraping?
Title: Source Code   
Name: Code?
Date: 2007-10-05 8:25:38 AM
Comment:
The source code seems to display fine, its justs some of the example that do not work.
Title: Pages can't load   
Name: Someone interested in this topic
Date: 2007-10-04 10:07:04 PM
Comment:
Looks like a great article whereby the author intent to show the full codes directly. But the thing is.... the pages can't load...and I see error pages all the time :(
Title: Too bad...   
Name: Can't see code
Date: 2007-04-26 3:34:08 AM
Comment:
Would be a great article, but something is wrong with aspalliance setup here. Saw a couple of the examples yesterday, but today all I get is the "500" error.
Title: Where's the code?   
Name: Where's the code?
Date: 2007-02-01 3:38:30 PM
Comment:
Where's the code?
Title: Solution to Exception Details: System.Net.WebException: The underlying connection was closed: The remote name could not be resolved.   
Name: Digant Desai
Date: 2007-02-01 1:55:22 AM
Comment:
Try to add proxy server property with objReq.Proxt = new ProxyServer("ProxyServerName")
Title: .NET File Posting   
Name: Natalia
Date: 2006-12-01 1:42:38 PM
Comment:
Please advise article link on file uploading along with form elements using WebRequest or WebClient.
Title: Any solution to my above problem   
Name: Ritesh
Date: 2006-08-02 4:26:09 AM
Comment:
Is there any solution or has anyone ever encountered this
Title: Error : The underlying connection was closed: The remote name could not be resolved.   
Name: Ritesh
Date: 2006-08-02 4:25:11 AM
Comment:
Exception Details: System.Net.WebException: The underlying connection was closed: The remote name could not be resolved.

Source Error:


Line 39: HttpWebResponse objRes;
Line 40: objReq = (HttpWebRequest)WebRequest.Create("http://aspalliance.com/damianm/");//("http://news.google.com/news?ned=us&topic=h&output=atom");
Line 41: objRes = (HttpWebResponse)objReq.GetResponse();
Line 42: objSReader = new StreamReader(objRes.GetResponseStream());
Line 43: #endregion
Title: Great Article   
Name: William
Date: 2006-07-27 2:22:20 PM
Comment:
This article is complete and exactly what I wanted to read!
Thanks!
-william,
Title: screen scaraping   
Name: sara
Date: 2006-05-05 4:24:31 AM
Comment:
can i scrap pages in a for loop. if i repeat scraping inside loop it is very slow.is there any solution?
thanks,
sara
Title: screen scraping   
Name: satya
Date: 2006-03-18 8:43:05 AM
Comment:
This is usefull for me,what is the concept of screen scraping and how it works.
thank u
Title: Can't view code   
Name: Victor
Date: 2006-01-05 8:26:35 PM
Comment:
I can't view most of the code.
Title: xxx   
Name: Raj
Date: 2005-12-29 3:43:06 PM
Comment:
How can we upload / post file ? dose any one know

Raj
Title: javascript redirect   
Name: alex
Date: 2005-07-28 11:44:04 AM
Comment:
The articles are useful but they don't talk about page redirection. Does WebRequet or WebClient follow javascript page redirection.
Title: Lucid and informative   
Name: Mark
Date: 2005-06-20 2:01:10 AM
Comment:
A very clear run-through - many thanks!
Title: shite   
Name: dave
Date: 2005-05-12 4:16:30 AM
Comment:
i take it back... the article is useful but the way that the source code is presented by aspaliance is very frustrating

Product Spotlight
Product Spotlight 





Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2017 ASPAlliance.com  |  Page Processed at 2017-12-18 5:12:20 PM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search