CodeSnip: Screen Scraping Lists: ASP Alliance

Print Add To Favorites Email To Friend Rate This Article
CodeSnip: Screen Scraping Lists

page 1 of 1

Published: 28 Oct 2003

Unedited - Community Contributed

Abstract
Ever wanted to add a news, feed to you site? Here is an example of how to do just that, buy using simple screen scraping methods and a regular expression to extract news simply and easily.

by Damian Manifold
Feedback

Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 8953/ 9

There have been articles on ASPAlliance about data scraping, concerning returning an entire page or a particular element. Building on the base of the other articles we will be using the grouping constructs to easily retrieve a list of headlines from Guardian Unlimited.

Source Code

News.ascx	News Control
News.ascx.vb	Code behind to scrape the data
Example.aspx	An Example

You will find the function to scrape the HTML functionally the same as with the other articles, but with the addition of a try-catch statement, as you can never be too cautious when using resources outside of your control.

The getNews function is where things differ. This is where all the work takes place and is what we will be concentrating on here. It is surprisingly small considering its size if you did not use grouping constructs.

27	private function getNews() as System.Data.DataTable
28	Dim rowNewsItem as System.Data.DataRow
29
30	'create the table to be returned
31	getNews = new System.Data.DataTable()
32	getNews.Columns.Add("strURL")
33	getNews.Columns.Add("strHeadline")
34	getNews.Columns.Add("strSummary")
35
36	'set up the regular expression for the news page
37	Dim strRegex as string
38	strRegex = _ "<A HREF='(?<strURL>[^']+)'[\s]?>(?<strHeadline>[^<]+)</A>[\s\w\W]?<BR>(?<strSummary>[^<]+)<"
39	Dim Regex as System.Text.RegularExpressions.Regex
40	Regex = new System.Text.RegularExpressions.Regex(strRegex, _ System.Text.RegularExpressions.RegexOptions.Compiled)
41
42	'scrape the data
43	Dim Matches as System.Text.RegularExpressions.MatchCollection = _ Regex.Matches(getHTML("http://www.guardian.co.uk/syndication/service/0,11065,331-0-5,00.html"))
44	Dim Match as System.Text.RegularExpressions.Match
45
46	'loop through all matches filling out the table as you go
47	for Each Match in Matches
48	rowNewsItem = getNews.NewRow()
49	rowNewsItem("strURL") = Match.Groups("strURL").Value
50	rowNewsItem("strHeadline") = Match.Groups("strHeadline").Value
51	rowNewsItem("strSummary") = Match.Groups("strSummary").Value
52	getNews.Rows.Add(rowNewsItem)
53	Next
54	End function

(Line Continuation Characters Above are for Display Only)

Generated using CodeView

Lines 31-34 deal with the creation of a DataTable. DataTables are a useful feature of .NET they allow you to easily pass data between function, without loosing any clarity.

Take a look at line 38; you will see the strRegex String. Now that is a monster of an expression but one hell of a powerful one.

Let us take a closer look at the construction of strRegex. The String can be broken down into four basic parts.

Literals

<A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

These will be exactly matched against the string, helping to locate the text that you are interested in.

Character Sets

<A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

The contents of the square brackets ([]) is the set of characters that you wish to dispose. In the above example's \s represents any white space character; \w represents any word character, and \W represents any non word character.

Quantifiers

<A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

The *? indicates that the preceding character is repeated multiple times; this differs from the * on its own in that the * will try to match the longest possible string where the *? matches the shortest. What is the difference, you may ask. Well let us imagine you want to get the first cell in a table. "<TD>[\s\w\W]*</TD>": this looks like it should work, but it would match is the <TD> of the first cell with the </TD> of the last.

Grouping Constructs

<A HREF='(?[^']+)'[\s]*?>(?<strheadline>[^<]+)</A>[\s\w\W]*?<BR>(?<strsummary>[^<]+)<

Now this is where the magic happens. This enables us to extract the information we are after without all that messing about with indexOf or subString and all the validation that goes along with it. "(?[^']+)": this takes the value from the current position up to but not including the ‘ and assigns it into the strURL construct. Meaning you can now refer to the data by name.

Once you have matched the data, all that is left to do is loop though the collection and fill out the table.

So there you have it, a function to retrieve the news as a DataSet.
As an example, the DataList has been bound to a custom control, making a nice little control that can add content to any page.

Examples

Example.aspx

The News Control in Action

Articles

.NET Screen Scraping in depth	by Damian Manifold
Easy .NET Screen Scraping	by Steven Smith
ASP.NET Data Scraping	by G. Andrew Duthie
Regular Expressions Quickstart	by Chris Garrett

Links

Guardian Unlimited	More News Feeds That Can Be Scraped
Regular Expression Library	More on Regular Expressions
ASPFriends	Discuss Regular Expressions

User Comments

No comments posted yet.

Product Spotlight