Extending an Existing Service: Post Mortem;
page 1 of 1
Published: 13 Oct 2003
Unedited - Community Contributed
Abstract
Screen scraping can be fun, but sometimes a serious business. This article describes how to extend an existing service via screen scraping.
by Terry Voss
Feedback
Average Rating: This article has not yet been rated.
Views (Total / Last 10 Days): 7444/ 20

A web site as a service layer.

 

Extending an Existing Service: post mortem;
Author: Terry Voss
This project started differently than most that this broker has given me in the past. He had me fly down to understand the project so that I could get started right away. The project was to connect to a server and duplicate an existing service that would activate a new service for a customer, but the existing service only worked for an agent and not each dealer of the agent, so all the dealers of an agent had to phone in and fax info that the agent used to do the activation. The service was a web application series of about 12 html pages that did the activation (call it x), but it worked with all the customers of all dealers for the agent and could not separate the dealers. Dealers can not have access to other dealer's data as the nature of activation is highly competitive due to commissions paid upon each activation to the dealer from the agent.
 
There was no time for Use Case or Design nor anyone to really talk to about any of this. They needed to prove that the connection could be made, and then go from there.
 
The company providing the service was motivated to allow the agent to extend their service to the dealers because this agent was losing dealers to their competition due to the slow activation process that was occurring and due to other agents using telnet and other ways of getting fast access to activation, but the service provider was a large company that could not afford for its IT staff to be interrupted by our project so they declared very clearly that they would not give us any information to help us with this project. I felt this fear the more I listened and asked the broker whether I would be paid for my time if this connection never got going. He said yes. One 17 year old girl was the only person who knew anything about the program x and she showed me around it although I really could not understand how it worked or what the goal was because it was filled with so many industry terms and such. There wasn't much use in getting to know the app very much until I found out if I could make the basic signin and connections, and then my concern became totally whether I could get the first page to go and so it remained a page by page process.
 
Each time I had a success and got past some hurdle towards making a successful login my hopes were dashed as I slowly found out that the service provider had a security system with SSL, proxy, cookie sets with expiring passwords, a main password that changed, etc. When I finally got logged in and received my first html page I was in ecstasy. What follows is the abstract code for handling these basic security items that most larger companies would use. I cannot show more detail due to competing companies wanting to do the same thing.
 
Public Class Server
Dim strConn As String = XmlSetting.Read("identity", "connectionstring")
Dim sqlData As SqlService = New SqlService()
Dim ds As DataSet
Dim dt As DataTable
Dim dr As DataRow
Dim dv As DataView
Dim strSql As String
Dim appAssembly As System.Reflection.Assembly
Dim htmpath As String
Dim smarket As String = XmlSetting.Read("identity", "smarket")
Dim sdistrict As String = XmlSetting.Read("identity", "sdistrict")
Dim username As String = XmlSetting.Read("identity", "username")
Dim soffice As String
Dim sforce As String = XmlSetting.Read("identity", "sforce")
Dim srepid As String = XmlSetting.Read("identity", "srepid")
Dim password As String = XmlSetting.Read("appsettings", "password")
Dim certpath As String = XmlSetting.Read("identity", "certpath")
Dim vdomain As String = XmlSetting.Read("identity", "vdomain")
Dim servpath As String = XmlSetting.Read("identity", "servpath")
Dim nearfullpath As String = "https://" + vdomain + servpath
Dim mainpath As String = nearfullpath + "sss/"
Dim proxypath As String = XmlSetting.Read("identity", "proxypath")
Dim proxyport As Integer = CType(XmlSetting.Read("identity", "proxyport"), Integer)
Dim sCookie1 As Cookie = New Cookie("XXX")
Dim sCookie2 As Cookie = New Cookie("XXX")
Dim sCookie3 As Cookie = New Cookie("XXX")
Dim sCookie4 As Cookie
Dim sCookie5 As Cookie = New Cookie("awacs_u", username, servpath, vdomain)
Dim sCookie6 As Cookie
Dim sCookie7 As Cookie = New Cookie("awacs_r", srepid, servpath, vdomain)
Dim sCookieContainer As CookieContainer = New CookieContainer()
 
Sub New()
  MyBase.new()
  appAssembly = System.Reflection.Assembly.GetCallingAssembly
  htmpath = appAssembly.CodeBase
  htmpath = htmpath.Substring(8)
  htmpath = Path.GetDirectoryName(htmpath).Replace("bin", "")
End Sub
 
Public Function HttpGetHtml(ByVal url As String, ByVal office As String) As String
  Dim strID As String = ""
  Dim sr As StreamReader
  Dim req As HttpWebRequest
  Dim res As HttpWebResponse
  Dim cer As X509Certificate  ' I received a different kind of certificate so I had to import it and then export the type that .NET supported
  Dim proxy As WebProxy     ' a proxy is a computer with an address that is in front of the actual server you want to reach. It can check the certification.
  Dim strError As String         ' if you checked for the address of the actual server it would not exist on the internet possibly like in this case.
  Dim strCookies As String
  proxy = New WebProxy(proxypath, proxyport) ' the proxy and the webrequest work together to get to the real url
  req = WebRequest.Create(url)                        ' this is the server providing the service
  sCookieContainer.Add(sCookie1)
  sCookieContainer.Add(sCookie2)
  sCookieContainer.Add(sCookie3)
  soffice = office
  sCookie4 = New Cookie("XXX", soffice, servpath, vdomain)
  sCookieContainer.Add(sCookie4)
  sCookieContainer.Add(sCookie5)
  strID = TryReadFile(htmpath + "id.txt")
  sCookie6 = New Cookie("awacs_password", strID, servpath, vdomain)
  sCookieContainer.Add(sCookie6)
  sCookieContainer.Add(sCookie7)
  req.CookieContainer = sCookieContainer
  cer = X509Certificate.CreateFromCertFile(certpath)
  req.ClientCertificates.Add(cer)
  req.Proxy = proxy
  req.Method = "GET"          ' sometimes a post will be required for just certain pages for some detailed reason. The signin and post are slight variations of this code.
  req.ContentType = "application/x-www-form-urlencoded" ' the next 4 lines may or may not be important in your case
  req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)"
  req.KeepAlive = True
  req.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*"
  res = CType(req.GetResponse(), HttpWebResponse)
  sr = New StreamReader(res.GetResponseStream) ' this is the actual connection to the other server
  Dim strHtmlx As String = sr.ReadToEnd
  strID = Me.Getstring(res.Headers.ToString, "awacs_password", 15, 16) ' get password out of the header
  TryWriteFile(strID, htmpath + "id.txt") ' here I have to try to write until I succeed since many user may be hitting this at once.
  res.Close()
  Return strHtmlx ' the returned string is an entire html page which I will be screen scraping to build my own pages for my own control of the service.
End Function

 
The http classes of .NET work very well to handle this as is seen by the few lines of code needed to solve the connection problem. Now it seemed simple.
A dealer employee would request my program with some data and I would then request that same data of the service providing server. It would return an html page to me and I would scrape it of info needed to build my next aspx page and present it to the user to see if they could go on or not.
 
Each of the 12 pages was another level of success towards the activation depending on the credit and disposition of the customer to comply. I used the same pattern every page: Present my aspx page and when filled out and validated, submit  request to remote server with a parameter list that it required. Then receive an html page from them which could be an error page or a meaningful page. I handled the error or scraped the page to get data that the next page needed and then put it into an arraylist which seemed so handy since I could put single pieces of data into it or a collection also. Then put the arraylist into the session just long enough to get to their next page load where I grabbed the arraylist back from the session and used the info in it to build my page. All the info that was captured was pretty much saved into our own Sql database and used for administrative control or reporting or commissions related uses. How did I know what to send to the service provider server or x?
I submitted the pages of their app x individually by editing their source code action target and sending the form get to my own aspx page with trace turned on and it showed me everything that their server was expecting.
 
My next main concern was that the screen scraping would not continue working after changes that the service provider would make due to customer complaints about how the program worked or due to request for new features. The use of regular expressions helped here to create some stability.  It is hard for x to have a combo box of options without having some option elements in there. Matches class worked great here.
 
Note the first few lines is how I got a single element out of the html. I wrote GetDelString method so Html is the whole html page I received. Srch is to get close to my need and then del1 and del2 are the exact delimiters of what I wanted. This way I could pick something I felt was fairly stable to get close enough to get uniqueness on my delimiters. Over 8 months of usage we've had no changes required in the screen scraping. This was a pleasant surprise.
 
Srch = "name=" + Quote + "sso"
del1 = "value=" + Quote
del2 = Quote + " >"
rv.sso = curServer.GetDelString(Html, Srch, del1, del2)
Session("phonchoi") = rv
Dim al1 As ArrayList = New ArrayList()
Dim r1 As Regex = New Regex("", RegexOptions.IgnoreCase)
Dim mc As MatchCollection
Dim m As Match
Dim strSubHtml As String
strSubHtml = curServer.GetDelString(strHtml, "
By the time I got to the 12th page and got some beta activations going, feature requests were being listed each day and I had a hard time keeping up with them and the few daily bug finds. I had someone doing all the hardware setup so this helped a lot. After this stage calmed down a bit I started noticing complexity of code as a problem. I couldn't find anything fast enough when I needed to fix a bug as now the program was being used by many dealers with new ones being put onto the system daily by a fulltime field rep. I didn't know anything of the overall structure except page by page so I just threw all methods into a businesslayer class. After the first beta testing we put in two servers for hits and one for the database hoping that would hold us for a while. We had to go to out of process sessions because of this and because each server had four processors. Made classes that would go into sessions serializable. During my first monitoring of our app I was surprised to see that we were almost maxed out already. See Monitoring Your Web App
 
The one path through 12 pages became 12 paths of 12 pages each for different features selected. Every time we added a feature, all 12 paths had to be retested before putting the new features into production. I created an Nunit test for the main path which I could then test in a few seconds and then a custom FTP program that listed all our project files. It worked 2 ways. The client demanded all source code exist with backups at all 7 servers being used. No problem we were behind a good hardware firewall.  I could copy a few source files and dlls or all files with or without dlls except for the configuration files. This saved tons of time and made the process flawless as far as not copying over others configurations. There were still 2 hit servers, a database server, my development machine which was in a training mode allowing fake activations to take place, a testing machine at the agent's customer service site where I trained a person to test the other 11 paths, a backup machine, a production server at my site solely for being able to serve problem examples where I could trace the problem to a line of code and fix it fast. It was many hundreds of thousands of bytes now just in the businesslayer. I split this up into a structure that by now was becoming obvious of server, agent, dealer, customer, account, products.
 
One server had many agents, one agent had many dealers, one dealer had many customers, one customer had many accounts, one account had many products that needed activation. I created the six classes and gave the methods to the proper owner. Server had the http connection methods and screen scraping. Agent owned all the administration and reports except the dealer reports. Dealer had its reports, and the responsibilities of login which had become necessarily complex due to people trying to run the program from their homes, after midnight when x did its housekeeping, etc. Customer class had lots of methods. Account had a few, and products quite a few.
 
Then representatives of  the service provider began with,  your program doesn't do everything that the x program does, so more features and bugs, but now the parameters posted to x had to be perfect and debugging this was difficult because the arraylists depended on integer indexes that caused looking back and forth from page to page to see what was being passed and to see if it was going into the proper place.
 
Suddenly it dawned upon me that if I passed the customer and product objects filled with all the data this data would be self commented and that solved lots of problems immediately after I rewrote tons of code<s> Instead of passing generic arraylist collections, I would pass the structures that the app was about and of course this was a great simplification.
Once the feature requests died down a bit, I thought we should look at performance and scalability some with about a hundred dealers now using the program with more going onto the agent's program each day. The project was deemed a success since dealers were dropping their current agents to come on board with this fast way of activating. This increased sales 50% for many dealers because customers had been walking away from deals in mid activation after waiting in stores for an hour. The activation process could take as little as 6 minutes now.
One very valuable report showed hits and average processing time per page for any duration. Our most used page was our slowest. Fixing this helped a lot. Remember our app was piggyback on another service. Both processing times had to be added to observe what users saw. I put in messages showing how long the remote server took versus the local server so users could tell us about extreme cases easier. Each hit was averaging 1.5 seconds, a nightmare for a website. I found the longest running times were the administration reports that were operating on more and more data. These could take 60 seconds, holding one server processor hostage all that time. All administration was shifted to a non-busy server. Garbage collection was rampant. I went through each page looking for any instantiations inside loops or making sure a needed object was only instantiated once. I rewrote stored procedures for speed. We found people were clicking their mouse 100 times per minute to gain credit checking advantages, so I wrote a Windows service that checked the database every minute for someone doing that and blocked their IP address from logging in until they called and explained why they were doing that.
 
No downtime was crucial here as 175 stores were signed up. Customer service is immediately swamped if any glitch occurs. If the service provider ever went down, we knew instantly. We became their alert to the fact and they came to rely on us for updating them of down times. The service provider finally became friendly to us about this point and shared secrets somewhat, so the story has a happy ending. Don't we all hope for the Use Case upfront and then design work before beginning? Sometimes we must work blind. This project was done 99% by telecommuting, even though I visited the site a couple of times for training people, most of the training was done offsite over the phone.
 
Three recent observations:
1) So many dealers have signed up that accounting can not keep up with the commissions situation which depends on complex chargebacks, so I  worked on a commissions program inside an accounting package that dovetails into the activation program instead of just delivering reports.
 
2) I realize that if I knew what I know now, that an activation class should have been designed to handle all the page to page data movement and then it could deliver data to the properties of the customer and product which could take data to the tables of our database. If I could have designed before hand think of how much time could have been saved using a DAL generator to get the stored procs and classes upfront.
 
3) The value of having one tester that could buffer between customers, customer service reps, and me was very key.
 
In-house people have since taken over the project and I'm on to my next project, but this is the only thing I've ever done like this in fifteen years of projects and thought it might be an interesting share for some. It seems to have been a positive process for me to think back over this last twelve months of activity.
Send mail to Computer Consulting with questions or comments about this web site.
Last modified: October 09, 2003



User Comments

No comments posted yet.

Product Spotlight
Product Spotlight 





Community Advice: ASP | SQL | XML | Regular Expressions | Windows


©Copyright 1998-2024 ASPAlliance.com  |  Page Processed at 2024-09-13 8:42:10 AM  AspAlliance Recent Articles RSS Feed
About ASPAlliance | Newsgroups | Advertise | Authors | Email Lists | Feedback | Link To Us | Privacy | Search