AspAlliance.com LogoASPAlliance: Articles, reviews, and samples for .NET Developers
URL:
http://aspalliance.com/articleViewer.aspx?aId=233&pId=-1
CodeSnip: Screen Scraping Lists
page
by Damian Manifold
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 9044/ 14

There have been articles on ASPAlliance about data scraping, concerning returning an entire page or a particular element. Building on the base of the other articles we will be using the grouping constructs to easily retrieve a list of headlines from Guardian Unlimited.

Source Code
News.ascx News Control
News.ascx.vb Code behind to scrape the data
Example.aspx An Example

You will find the function to scrape the HTML functionally the same as with the other articles, but with the addition of a try-catch statement, as you can never be too cautious when using resources outside of your control.

The getNews function is where things differ. This is where all the work takes place and is what we will be concentrating on here. It is surprisingly small considering its size if you did not use grouping constructs.

  27  private function getNews() as System.Data.DataTable
  28      Dim rowNewsItem as System.Data.DataRow
  29  
  30      'create the table to be returned
  31      getNews = new System.Data.DataTable()
  32      getNews.Columns.Add("strURL")
  33      getNews.Columns.Add("strHeadline")
  34      getNews.Columns.Add("strSummary")
  35  
  36      'set up the regular expression for the news page
  37      Dim strRegex as string
  38      strRegex = _
  "<A HREF='(?<strURL>[^']+)'[\s]*?>(?<strHeadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strSummary>[^<]+)<"
  39      Dim Regex as System.Text.RegularExpressions.Regex
  40      Regex = new System.Text.RegularExpressions.Regex(strRegex, _
       System.Text.RegularExpressions.RegexOptions.Compiled)
  41  
  42      'scrape the data
  43      Dim Matches as System.Text.RegularExpressions.MatchCollection = _
       Regex.Matches(getHTML("http://www.guardian.co.uk/syndication/service/0,11065,331-0-5,00.html"))
  44      Dim Match as System.Text.RegularExpressions.Match
  45  
  46      'loop through all matches filling out the table as you go
  47      for Each Match in Matches
  48          rowNewsItem = getNews.NewRow()
  49          rowNewsItem("strURL") = Match.Groups("strURL").Value
  50          rowNewsItem("strHeadline") = Match.Groups("strHeadline").Value
  51          rowNewsItem("strSummary") = Match.Groups("strSummary").Value
  52          getNews.Rows.Add(rowNewsItem)
  53      Next
  54  End function
(Line Continuation Characters Above are for Display Only) Generated using CodeView

Lines 31-34 deal with the creation of a DataTable. DataTables are a useful feature of .NET they allow you to easily pass data between function, without loosing any clarity.

Take a look at line 38; you will see the strRegex String.  Now that is a monster of an expression but one hell of a powerful one.

Let us take a closer look at the construction of strRegex. The String can be broken down into four basic parts.

Literals


 <A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<
 

These will be exactly matched against the string, helping to locate the text that you are interested in.

Character Sets


 <A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<
 

The contents of the square brackets ([]) is the set of characters that you wish to dispose. In the above example's \s represents any white space character; \w represents any word character, and \W represents any non word character.

Quantifiers


 <A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<
 

The *? indicates that the preceding character is repeated multiple times; this differs from the * on its own in that the * will try to match the longest possible string where the *? matches the shortest. What is the difference, you may ask. Well let us imagine you want to get the first cell in a table. "<TD>[\s\w\W]*</TD>": this looks like it should work, but it would match is the <TD> of the first cell with the </TD> of the last.

Grouping Constructs


 <A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<
 

Now this is where the magic happens. This enables us to extract the information we are after without all that messing about with indexOf or subString and all the validation that goes along with it. "(?[^']+)": this takes the value from the current position up to but not including the ‘ and assigns it into the strURL construct. Meaning you can now refer to the data by name.

Once you have matched the data, all that is left to do is loop though the collection and fill out the table.

So there you have it, a function to retrieve the news as a DataSet.
As an example, the DataList has been bound to a custom control, making a nice little control that can add content to any page.

Examples
Example.aspx The News Control in Action

Articles
.NET Screen Scraping in depth by Damian Manifold
Easy .NET Screen Scraping by Steven Smith
ASP.NET Data Scraping by G. Andrew Duthie
Regular Expressions Quickstart by Chris Garrett

Links
Guardian Unlimited More News Feeds That Can Be Scraped
Regular Expression Library More on Regular Expressions
ASPFriends Discuss Regular Expressions


Product Spotlight
Product Spotlight 

©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-04-16 7:07:23 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search