There have been articles on ASPAlliance about data scraping, concerning returning an entire page or a particular element. Building on the base of the other articles we will be using the grouping constructs to easily retrieve a list of headlines from Guardian Unlimited.
You will find the function to scrape the HTML functionally the same as with the other articles, but with the addition of a try-catch statement, as you can never be too cautious when using resources outside of your control.
The getNews function is where things differ. This is where all the work takes place and is what we will be concentrating on here. It is surprisingly small considering its size if you did not use grouping constructs.
|| private function getNews() as System.Data.DataTable|
|| Dim rowNewsItem as System.Data.DataRow|
|| 'create the table to be returned|
|| getNews = new System.Data.DataTable()|
|| 'set up the regular expression for the news page|
|| Dim strRegex as string|
|| strRegex = _|
|| Dim Regex as System.Text.RegularExpressions.Regex|
|| Regex = new System.Text.RegularExpressions.Regex(strRegex, _|
|| 'scrape the data|
|| Dim Matches as System.Text.RegularExpressions.MatchCollection = _|
|| Dim Match as System.Text.RegularExpressions.Match|
|| 'loop through all matches filling out the table as you go|
|| for Each Match in Matches|
|| rowNewsItem = getNews.NewRow()|
|| rowNewsItem("strURL") = Match.Groups("strURL").Value|
|| rowNewsItem("strHeadline") = Match.Groups("strHeadline").Value|
|| rowNewsItem("strSummary") = Match.Groups("strSummary").Value|
|| End function|
|(Line Continuation Characters Above are for Display Only)
||Generated using CodeView|
Lines 31-34 deal with the creation of a DataTable. DataTables are a useful feature of .NET they allow you to easily pass data between function, without loosing any clarity.
Take a look at line 38; you will see the strRegex String. Now that is a monster of an expression but one hell of a powerful one.
Let us take a closer look at the construction of strRegex. The String can be broken down into four basic parts.
These will be exactly matched against the string, helping to locate the text that you are interested in.
The contents of the square brackets () is the set of characters that you wish to dispose. In the above example's \s represents any white space character; \w represents any word character, and \W represents any non word character.
The *? indicates that the preceding character is repeated multiple times; this differs from the * on its own in that the * will try to match the longest possible string where the *? matches the shortest. What is the difference, you may ask. Well let us imagine you want to get the first cell in a table. "<TD>[\s\w\W]*</TD>": this looks like it should work, but it would match is the <TD> of the first cell with the </TD> of the last.
Now this is where the magic happens. This enables us to extract the information we are after without all that messing about with indexOf or subString and all the validation that goes along with it. "(?[^']+)": this takes the value from the current position up to but not including the ‘ and assigns it into the strURL construct. Meaning you can now refer to the data by name.
Once you have matched the data, all that is left to do is loop though the collection and fill out the table.
So there you have it, a function to retrieve the news as a DataSet.
As an example, the DataList has been bound to a custom control, making a nice little control that can add content to any page.