Listing 1:WSDL File
<?xml version="1.0" encoding="utf-8"?>
<wsdl:definitions
xmlns:s="http://www.w3.org/2001/XMLSchema"
xmlns:http="http://schemas.xmlsoap.org/wsdl/http/"
xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/"
xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/"
xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:s0="http://www.taryatechnologies.com"
targetNamespace="http://www.taryatechnologies.com"
xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/">
<wsdl:types/>
<wsdl:message name="msgHttpGetIn" />
<wsdl:message name="msgHttpGetOut" />
<wsdl:portType name="ptypeHttpGet">
<wsdl:operation name="GetTaryaServices">
<wsdl:input message="s0:msgHttpGetIn"/>
<wsdl:output message="s0:msgHttpGetOut"/>
</wsdl:operation>
</wsdl:portType>
<wsdl:binding name="bindHttpGet"
type="s0:ptypeHttpGet">
<http:binding verb="GET"/>
<wsdl:operation name="GetTaryaServices">
<http:operation location="/aboutus.asp"/>
<wsdl:input>
<http:urlEncoded/>
</wsdl:input>
<wsdl:output>
<tm:text>
<tm:match
name='myServices'
pattern='<ul>(.*?)ul>'
ignoreCase='true'
repeats='100' />
</tm:text>
</wsdl:output>
</wsdl:operation>
</wsdl:binding>
<wsdl:service name="TaryaService">
<wsdl:port
name="ptypeHttpGet"
binding="s0:bindHttpGet">
<http:address location="http://www.taryatechnologies.com" />
</wsdl:port>
</wsdl:service>
</wsdl:definitions>
I am not going to explain in detail about a WSDL file as it is outside the scope of this article. You can reference that information at http://www.w3.org/TR/wsdl
In the above file, the values in bold are the variables, which you have to change when you create your own WSDL file. I explain them below.
Listing 2:
xmlns:s0="http://www.taryatechnologies.com"
targetNamespace="http://www.taryatechnologies.com"
You have to specify the URL of the website from which you extract information.
Listing 3:
<wsdl:operation name="GetTaryaServices">
This is the method name. It can be anything you fancy. You use it later in the code (for consuming web service).
Listing 4:
<http:operation location="/aboutus.asp"/>
This is the relative path to the specific file from which you extract information.
Listing 5:
<tm:match
name='myServices'
pattern='<ul>(.*?)ul>'
ignoreCase='true'
repeats='100' />
This is the most important part of the WSDL file. The name of the match element can be anything. The value of the pattern gives the actual content from the website. You need to be skillful while writing this expression. There are good sources on the Net to learn pattern matching. One of them is available at http://www.evolt.org/node/22700.
Before writing expression, you need to define what exactly you want from the website. You have to see the HTML source of the web page from which you want extract information. In our example I want to extract the services offered by Taryatechnologies. View the HTML code for www.taryatechnologies.com/aboutus.asp. The piece of information what we want is as follows (in HTML ).
<ul>
<li>Web Site Development</li>
<li>Web Applications</li>
<li>Web services</li>
<li>Graphical Designs</li>
<li>Mobile Applications</li>
<li>Digital Signage solutions</li>
</ul>
We want the information between tags <ul> and </ul>. So our regular expression will be: <ul>(.*?)ul>
The same result can be obtained by different expressions.
() Used to group sequences of matches.
. Matches any character except new line.
* Matches zero or more times.
? Matches zero or one time.