Microsoft's suggestion is to always use the managed HttpWebRequest & HttpWebResponse classes and not to use the unmanaged WinInet.dll which could go away from usability from managed code let alone an ASPX usage in any upcoming service pack, but for sure it will be gone with Longhorn except for legacy apps. I agreed fully with this concept for network programming until I had a special experience with one large company server.
We were given permission by this company to hack their website so that an agent of theirs (who was a client of ours) could separate sales by dealers. The company has a contract with an agent who has 100 dealers. The web site handles sales for a dealer but has no ability to pull dealers together and keep them separate also.
This is the capability we needed to add, and since the sale is a very complex relationship with the company, the agent must rely on the progress of the sale as it progresses through the website that checks credit and many other things that actually allow the sale to occur. There are also chargebacks involved.
These and other reasons make the dependence on the company website crucial. A good solution would allow the agent to gain dealers away from other agents who cannot pay commissions as immediate as the agent that has immediate feedback from the company website. Actually, the agent almost tripled their number of dealers almost immediately.
So again, always use HttpWebRequest, not WinInet.dll for another reason. Microsoft does not support the use of WinInet.dll from an ASPX page for many reasons having to do with very different platforms. In the past, in trying to automate creation of an Outlook task from an ASPX page, I've found their warnings to be true, but WinInet.dll does seem to work. What happened to me was that I found a situation that required using WinInet versus the managed alternative. I'll show you how the HttpWebRequest attempt got into trouble, and you also might learn something about programming the API from .NET here.
The first step is to change one line of the source code of the HTML pages (which had much Javascript in them) that called for the interesting pages that we wanted to screen scrape with regular expressions (more on this later). We then would submit them by changing their form tag to submit to a simple, no interface (no controls) submit.aspx file on our server instead of the usual server so we could list the details of the request and fill our HttpWebRequest object.
In Submit.aspx's code behind, we used code like this to see what their website was sending to their server to get to the valuable screens. Note that we had to examine the cookies that the big company server put on our computers when we used their website and also had to learn how to install a client certificate so it would be used by the ASP.NET login user that exists in the machine.config file.
<form name="myform" method="post" action="/mysubfolder/myjavascript.jsp">
becomes
<form name="myform" method="post" action="submit.aspx">
Imports System.Collections.Specialized
Imports System.Diagnostics
Public Class submit : Inherits System.Web.UI.Page
Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim loop1 as integer
Dim loop2 As Integer
Dim arr1() as String
Dim arr2() As String
Dim coll As NameValueCollection
' Load ServerVariable collection into NameValueCollection object.
coll = Request.ServerVariables
' Get names of all keys into a string array.
arr1 = coll.AllKeys
For loop1 = 0 To arr1.GetUpperBound(0)
Response.Write("Key: " & arr1(loop1) & "<br>")
' Get all values under this key.
arr2 = coll.GetValues(loop1)
For loop2 = 0 To arr2.GetUpperBound(0)
Response.Write("Value " & CStr(loop2) & ": " & arr2(loop2) & "<br>")
Next loop2
Next loop1
End Sub
End Class
I began programming for this server that I knew was more complex than any I had programmed against with code like the following. I can't show every little detail because the agent's competitors would love to know about some of them. One thing not shown is an important consideration for client certificates. HttpWebRequest has a read-only property that gets the collection of client certificates associated with this request.
An important consideration is that just because an application like .NET has added an existing certificate to this collection does not mean that that application has the permissions to access the certificate. The application must have the same access rights as the entity that issued the certificate. An important standard type of certificate for servers is X.509 that HttpWebRequest supports fine. (Note: If you just want to learn about WinInet.dll access from .NET, then just skip ahead now.)
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Imports Microsoft.VisualBasic.ControlChars
Imports System.Web.HttpContext
Public Class BigCompanyServer
Dim result, reqHeader, resHeader As String
Dim mode As String = XmlSetting.Read("appsettings", "mode")
Dim domain As String = IIf(mode = "T", XmlSetting.Read("appsettings", "domaint"),_
XmlSetting.Read("appsettings", "domainp"))
Dim loginPath As String = XmlSetting.Read("appsettings", "loginpath")
Dim myCookies As New CookieContainer()
Dim cookies As New CookieCollection()
Dim cookie As New cookie()
Public Function SignIn() As Boolean
Dim loginParameters As String
If mode = "T" Then
loginParameters = "?ACTION=LOGIN"
loginParameters &= "&CHPWD="
loginParameters &= "&WN_VIEW_FLAG=false"
loginParameters &= "&USERS_COOKIE=CODE1047055773000"
loginParameters &= "&USERID=NOCAL16"
loginParameters &= "&PASSWD=WINTER"
loginParameters &= "&RESERVEID=SAINT"
Else
loginParameters = "?ACTION=LOGIN"
loginParameters &= "&CHPWD="
loginParameters &= "&WN_VIEW_FLAG=false"
loginParameters &= "&USERS_COOKIE=CODX20030604074443"
loginParameters &= "&USERID=XXX29293"
loginParameters &= "&PASSWD=FINGERPRINT923"
loginParameters &= "&RESERVEID=9234"
End If
Dim url As String = domain + loginPath + loginParameters
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
Dim myCookies As New CookieContainer
req.Method = "GET"
req.Accept = "*/*"
' next line eg. might let a server know you are not the browser it was expecting
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705)"
req.Headers.Add("Accept-Language", "en-us")
req.CookieContainer = myCookies
Dim cert As IntPtr = CType(xmlsettings.read("parent", "childsetting"), IntPtr)
Dim certX509 As New X509Certificate(cert)
req.ClientCertificates.Add(certX509)
' this was used when I had trouble being accepted by the company server
'reqHeader = req.Headers.ToString + myCookies.GetCookieHeader(req.RequestUri).ToString
Dim res As HttpWebResponse = req.GetResponse()
Dim success As Boolean = res.Cookies.Count > 0
If success Then
Dim cookieHeader1 As String = String.Format("{0} = {1}", "SESSIONID", res.Cookies("SESSIONID").Value)
myCookies.SetCookies(New Uri(String.Format("{0}://{1}", uri.Scheme, uri.Host)), cookieHeader1)
Dim cookieHeader2 As String = String.Format("{0} = {1}", "HANDLEID", res.Cookies("HANDLEID").Value)
myCookies.SetCookies(New Uri(String.Format("{0}://{1}", uri.Scheme, uri.Host)), cookieHeader2)
End If
' note that I can get both request and response headers
'resHeader = res.StatusCode.ToString + ":" + res.Headers.ToString()
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
sr.Close()
Current.Session("cookies") = myCookies
'Current.Session("machine") = whichMachine ' development test servers versus production
Return success
End Function
Public Function GetHtml(ByVal currentPath As String, ByVal parameters As String) As String
Dim parameters1 As String = IIf(parameters.StartsWith("?"), parameters, "?" + parameters)
Dim url As String = domain + currentPath + parameters1
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
req.Method = "GET"
req.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,_
application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*"
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705)"
req.ContentType = "application/x-www-form-urlencoded"
req.KeepAlive = True
req.Headers.Add("Accept-Language", "en-us")
Dim myCookies As CookieContainer = Current.Session("cookies")
'Dim reqCookies As String = myCookies.GetCookieHeader(req.RequestUri).ToString
'reqHeader = req.Headers.ToString + reqCookies
req.CookieContainer = myCookies
Dim res As HttpWebResponse = req.GetResponse()
'resHeader = res.StatusCode.ToString + ":" + res.Headers.ToString
myCookies.Add(res.Cookies)
Current.Session("cookies") = myCookies
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
Dim Logged As Boolean = Not result.IndexOf("function invalidURL") > -1
sr.Close()
Return result
End Function
Public Function PostHtml(ByVal currentPath As String, ByVal parameters As String) As String
Dim parameters1 As String = IIf(parameters.StartsWith("?"), parameters.Substring(1), parameters)
Dim url As String = domain + currentPath
Dim uri As New Uri(url)
Dim req As HttpWebRequest = WebRequest.Create(uri)
req.Method = "POST"
req.ContentLength = parameters1.Length
req.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,_
application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*"
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.0.3705)"
req.ContentType = "application/x-www-form-urlencoded"
req.KeepAlive = True
req.Headers.Add("Accept-Language", "en-us")
Dim myCookies As CookieContainer = Current.Session("cookies")
Dim reqCookies As String = myCookies.GetCookieHeader(req.RequestUri).ToString
Dim writer As New StreamWriter(req.GetRequestStream())
writer.Write(parameters1)
writer.Close()
'reqHeader = req.Headers.ToString + reqCookies
req.CookieContainer = myCookies
Dim res As HttpWebResponse = req.GetResponse()
'resHeader = res.StatusCode.ToString + ":" + res.Headers.ToString
Dim sr As New StreamReader(res.GetResponseStream())
result = sr.ReadToEnd()
sr.Close()
Return result
End Function
End Class
Using regular expressions to scrape, especially the MatchCollection class, the current value or all the lists from all the select tags is very efficient, but if you do the search based on something that is not likely to change often your results will be even better. For example, search for Javascript function names versus some display tags that might be changed to make the page look better. After much initial effort to get a good login, I was confident with the application going forward with only a little concern about screen scraping as a stable method of receiving data.
An Alternative
It turned out that screen scraping never became a problem since the web pages were rarely changed. What was changed constantly was the large company software development team's server security due to concerns with hacking. They somehow could regularly (at least once a week) find a way to exclude our requests while maintaining all their browser requests. Each time it was very difficult for me as a non-expert network programmer to solve.
Microsoft will tell you over and over that anything that WinInet.dll does with a server can be exactly duplicated with the HttpWebRequest/Response classes. I believe this, but I would have to be intelligent enough to duplicate everything programmed into WinInet which is very good at reacting intelligently with every communication with a server, duplicating what the server sends back in the browser's very next request.
Even though the company had given us permission, their own software team had not; in fact they were ever-increasingly worried over hacking attempts, which is what we looked like to them. If I knew what they were worrying about I probably could have programmed easily against it, but I was working blind and finally gave up trying after about 15 weeks of fixing.
Company management and their software team never could get together to help us. I next took a tried and true Visual FoxPro 3rd-party tool, wwipstuff.dll by Rick Strahl and wrapped it in a Visual FoxPro Web Service, and it immediately worked flawlessly for two years without any problem controlled from a VB.NET ASPX page.
Now today my goal was to make the difficult (average 2 hours) deployment of the web service to each new server the client bought easier and to see if I can speed up the processing time also. This is why I wanted to control WinInet.dll as directly as possible from the VB.NET ASPX page.
I figured a quick search of Google would supply me with some code, and in an hour I'd have a nice improvement available to the client. Wrong! I found helpful articles related to FTP with WinInet but could not get the code sample conversions to work at all for screen scraping.
Below is the final code that works smoothly from an ASPX page. Note the A at the end of InternetOpenA and InternetOpenUrlA. The A stands for Ansi, and if you replace it with W, you get a Unicode version, if they exist for a function. Note the important DllImport attribute before each shared function associated with a DLL entrypoint or function. DllImportAttribute comes from the System.Runtime.InteropServices namespace. DllImport has optional properties that can be set. When char or string data is involved as input and/or output, then usually you would set the CharSet property to charset.auto.
In the case of InternetOpen, InternetOpenUrl, there are A and W versions. These two versions indicate a higher probability that one must be specified. There is only support for charset.ansi at least for .NET, as I tried the W version to no avail, and so charset.ansi must be specified. If you don't specify SetLastError:=True, then you will not be able to use the error checking method I am showing here which is the only method available to .NET.
The normal GetLastError usage does not work for .NET as you can read about in more detail with the top link at the article bottom. Note that the buffer argument for InternetReadFile is a one-dimensional byte array. I first tried string and then StringBuilder to no avail. This requires conversion before displaying or scraping the string. Notice that the IntPtr datatype is used for the handle that the InternetOpen function returns. It has a ToInt32 method allowing compatibility with the InternetCloseHandle function.
Public Class WebForm1 : Inherits System.Web.UI.Page
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
display.Text = WinInet.GetHtml(url, 14000)
' its a one liner due to the shared members of the WinInet Class
' note the second argument is the number of bytes you want to scrape
End Sub
End Class
'*************************************************
Imports System.text
Imports System.Runtime.InteropServices
Imports Microsoft.VisualBasic.ControlChars
Public Class WinInet
Const INTERNET_ACCESS_TYPE_DIRECT = 1
Const INTERNET_OPEN_TYPE_PROXY = 3
Const INTERNET_FLAG_RELOAD = &H80000000
Const USER_AGENT = "IE"
Shared handle As IntPtr
Shared session As Int32
Shared header As String = "Accept: */*" & Cr & Cr
Shared newBuffer() As Byte
Shared bytesRead As Int32
Shared size As Int32
Shared response As Int32
Shared context As Integer = 0
Shared flags As Integer = 0
Shared errorNum As Integer
Public Shared Function GetHtml(ByVal url As String, ByVal length As Int32) As String
Dim result As String
handle = Http.InternetOpen(USER_AGENT, INTERNET_ACCESS_TYPE_DIRECT, vbNullString, vbNullString, flags)
session = Http.InternetOpenUrl(handle, url, header, header.Length, INTERNET_FLAG_RELOAD, context)
If session = 0 Then
result = "Error: " & Marshal.GetLastWin32Error()
Else
ReDim newBuffer(length - 1)
response = Http.InternetReadFile(session, newBuffer, length, bytesRead)
If response = 0 Then
result = "Error Reading File: " & Marshal.GetLastWin32Error()
Else
' Use appropriate Encoding here to get string from byte array
result = System.Text.UTF8Encoding.UTF8.GetString(newBuffer)
End If
End If
Http.InternetCloseHandle(session)
Http.InternetCloseHandle(handle.ToInt32)
Return result
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetOpenA", _
CharSet:=CharSet.Ansi, ExactSpelling:=True, SetLastError:=True)> _
Public Shared Function InternetOpen( _
ByVal agent As String, _
ByVal accessType As Int32, _
ByVal proxyName As String, _
ByVal proxyBypass As String, _
ByVal flags As Int32) As IntPtr
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetOpenUrlA", _
CharSet:=CharSet.Ansi, ExactSpelling:=True, SetLastError:=True)> _
Public Shared Function InternetOpenUrl( _
ByVal session As IntPtr, _
ByVal url As String, _
ByVal header As String, _
ByVal headerLength As Int32, _
ByVal flags As Int32, _
ByVal context As Int32) As Int32
End Function
'InternetReadFile
<DllImport("WinInet.dll", _
EntryPoint:="InternetReadFile", _
CharSet:=CharSet.Auto, SetLastError:=True)> _
Public Shared Function InternetReadFile( _
ByVal handle As Int32, _
<MarshalAs(UnmanagedType.LPArray)> _
ByVal newBuffer() As Byte, _
ByVal bufferLength As Int32, _
ByRef bytesRead As Int32) As Int32
End Function
<DllImport("WinInet.dll", _
EntryPoint:="InternetCloseHandle", _
CharSet:=CharSet.Ansi, ExactSpelling:=True, SetLastError:=True)> _
Public Shared Function InternetCloseHandle( _
ByVal hInternet As Int32) As Int32
End Function
End Class
One could use the length input parameter of the GetHtml method of the WinInet class to be a chunking size and thereby get/append chunks until the whole web page is scraped, but that was not my need. Some pages I only need to scrape the first 100 bytes to get logged in, for example. To append in chunks use a do-loop like: Loop While ((bytesRead <> 0) And response).
Not only has this short code file solved the average two-hour deployment problem of the WinInet web service, but its speed is much approved over the last WinInet.dll usage also. So this above code is very valuable to me, but let's look again at the managed version versus the WinInet API version.
The lengths of the two solutions are similar. The managed code is much more illustrative of what is going on. The managed code is much more fun to program. The managed code gives more control and potentially will do anything that the WinInet.Dll will. The most important thing of course is that the company server still cannot tell that I am not a browser as it could with the managed application. I wish I could use the managed version, but sometimes the best of standards to follow are not the most practical for a particular situation.
Good luck and send your questions to: tvoss@computer-consulting.com or better yet to Email: aspnet@aspadvice.com where I watch for questions.
Links:
A good discussion of programming to the API from .NET in general (c#)
Managed Code and GetLastError
Replacing API Calls with .NET Framework Classes
The MarshalAsAttribute Class
Books: (I own the 2 books below and therefore can attest that their quality is good on network programming, although neither covers WinInet.dll)
1. Network programming in .NET : C# & Visual Basic .NET
by Fiach Reid (Paperback )
2. Network Programming for the Microsoft .NET Framework
by Jim Ohlund, et al (Paperback )