|
-
Nov 19th, 2011, 06:33 PM
#1
Thread Starter
Fanatic Member
scraping website that uses ajax
We have a site that was scraping a site to gather all models available of a product. The 3rd party site recently changed the website so it now uses ajax for users to select the manufacturer and then once they select that it loads a dropdown with products using ajax.
I currently was using httpwebrequest for all requests (see below).
Code:
Public Function fnRequest(ByVal sPOSTData As String, Optional ByVal bAutoRedirect As Boolean = False) As String
Dim uriSite As Uri
Dim sReturn As String
Dim srReader As StreamReader
Dim sTemp As String
sReturn = String.Empty
Try
' Setup request
uriSite = New Uri(m_sURL)
m_hwrRequest = DirectCast(WebRequest.Create(uriSite), HttpWebRequest)
m_hwrRequest.Referer = m_sReferer
m_hwrRequest.UserAgent = m_sUserAgent
m_hwrRequest.AllowAutoRedirect = bAutoRedirect
m_hwrRequest.AllowWriteStreamBuffering = True
m_hwrRequest.KeepAlive = False
m_hwrRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, application/x-shockwave-flash, */*"
If Not (m_ccCookies Is Nothing) Then
If m_ccCookies.Count > 0 Then
m_hwrRequest.CookieContainer = New CookieContainer()
m_hwrRequest.CookieContainer.Add(m_ccCookies)
End If
End If
If Not (sPOSTData Is Nothing) AndAlso sPOSTData.Length > 0 Then
Dim stWS As Stream
Dim aeEnc As ASCIIEncoding
Dim baBuf As Byte()
aeEnc = New ASCIIEncoding()
baBuf = aeEnc.GetBytes(sPOSTData)
m_hwrRequest.Method = "POST"
m_hwrRequest.ContentLength = baBuf.Length
m_hwrRequest.ContentType = "application/x-www-form-urlencoded"
stWS = m_hwrRequest.GetRequestStream()
stWS.Write(baBuf, 0, baBuf.Length)
stWS.Close()
'm_hwrRequest.AllowAutoRedirect = True
End If
m_hwrResponse = DirectCast(m_hwrRequest.GetResponse(), HttpWebResponse)
srReader = New StreamReader(m_hwrResponse.GetResponseStream())
sReturn = srReader.ReadToEnd()
srReader.Close()
'------------------------------------------------------------------
' capture the redirect location from the header
'------------------------------------------------------------------
Try
Dim wbHCol As WebHeaderCollection = m_hwrResponse.Headers
Dim i As Integer
For i = 0 To wbHCol.Count - 1
Dim header As String = wbHCol.GetKey(i)
Dim values As String() = wbHCol.GetValues(header)
If values.Length > 0 AndAlso header.ToLower = "location" Then
Location1 &= values(0)
End If
Next
Catch
Location1 = String.Empty
End Try
'------------------------------------------------------------------
If Not m_hwrResponse.Headers("Set-Cookie") Is Nothing Then
Dim ccContainer As New CookieContainer()
ccContainer = New CookieContainer()
ccContainer.SetCookies(m_hwrResponse.ResponseUri, m_hwrResponse.Headers("Set-Cookie"))
sTemp = m_hwrResponse.Headers("Set-Cookie").ToString
m_ccCookies.Add(ccContainer.GetCookies(m_hwrResponse.ResponseUri))
End If
Me.Referer = m_hwrResponse.ResponseUri.AbsoluteUri
'close response connection
m_hwrResponse.Close()
Catch ex As Exception
lblError.Text &= ex.Message.ToString
End Try
Return sReturn
End Function
#End Region
Now in fiddler the post appears to be done with ajax. I tried to send the post data the normal way but it didn't like that.
Does anyone have an example of how to do this? To get an idea go to http://www.chiefmfg.com/ and see Mount finder and select Projector.
Thanks to any info in advance.
-
Nov 20th, 2011, 01:02 PM
#2
Re: scraping website that uses ajax
When you say that it didn't like it, what exactly do you mean? Can you elaborate?
Gary
-
Nov 21st, 2011, 07:05 AM
#3
Thread Starter
Fanatic Member
Re: scraping website that uses ajax
This is the response I get
1|#||4|58|pageRedirect||%2fApplicationError.aspx%3faspxerrorpath%3d%2fDefault.aspx|
I can pm code if you like or post online somewhere.
Thanks.
-
Nov 21st, 2011, 07:51 AM
#4
Thread Starter
Fanatic Member
Re: scraping website that uses ajax
Here is the code I have so far.
Code:
Imports System.IO
Imports System.Net
Imports System.Data
Partial Class chief_ss
Inherits System.Web.UI.Page
#Region " Instance variables "
Private m_sUserAgent As String
Private m_hwrRequest As HttpWebRequest
Private m_hwrResponse As HttpWebResponse
Private m_ccCookies As CookieCollection
Private m_sReferer As String
Private m_sURL As String
Private m_sLocation As String
#End Region
#Region " Properties "
Public ReadOnly Property Cookies() As CookieCollection
Get
Return m_ccCookies
End Get
End Property
Public Property URL() As String
Get
Return m_sURL
End Get
Set(ByVal Value As String)
m_sURL = Value
End Set
End Property
Public Property Referer() As String
Get
Return m_sReferer
End Get
Set(ByVal Value As String)
m_sReferer = Value
End Set
End Property
Public Property Location1() As String
Get
Return m_sLocation
End Get
Set(ByVal Value As String)
m_sLocation = Value
End Set
End Property
#End Region
#Region " Constants "
Private Const CS_URL_LOGIN As String = _
"http://www.chiefmfg.com"
Private Const CS_URL_POST_GET_PROJECTORS As String = _
"ctl00%24ctl00%24ScriptManager1=ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24upMountFinderCascading%7Cctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType%241&__EVENTTARGET=ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType%241&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE={0}&__EVENTVALIDATION={1}&ctl00%24ctl00%24ctrlNavBar%24txtSearchBox=&ctl00%24ctl00%24ctrlNavBar%24hfKeywordValue=&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType=Projector&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24ddlManufacturers=&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24ddlModels=&__ASYNCPOST=true&"
#End Region
Public clsE As clsSendEmail = New clsSendEmail
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
m_sURL = String.Empty
m_sReferer = String.Empty
m_sUserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
m_hwrRequest = Nothing
m_hwrResponse = Nothing
m_ccCookies = New CookieCollection()
m_sLocation = 0
End Sub
Protected Sub btnRun_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles btnRun.Click
m_sURL = "http://www.chiefmfg.com"
Dim sHTML As String = fnRequest("")
Dim sViewState As String = fnExtract(sHTML, "__VIEWSTATE")
Dim sEventValidation As String = fnExtract(sHTML, "__EVENTVALIDATION")
sHTML = fnRequest(String.Format(CS_URL_POST_GET_PROJECTORS, sViewState, sEventValidation))
txtOutput.Text = sHTML
End Sub
#Region " Public Request Functions/Subroutines "
Public Function fnRequest(ByVal sPOSTData As String, Optional ByVal bAutoRedirect As Boolean = False) As String
Dim uriSite As Uri
Dim sReturn As String
Dim srReader As StreamReader
Dim sTemp As String
sReturn = String.Empty
Try
' Setup request
uriSite = New Uri(m_sURL)
m_hwrRequest = DirectCast(WebRequest.Create(uriSite), HttpWebRequest)
m_hwrRequest.Referer = m_sReferer
m_hwrRequest.UserAgent = m_sUserAgent
m_hwrRequest.AllowAutoRedirect = bAutoRedirect
m_hwrRequest.AllowWriteStreamBuffering = True
m_hwrRequest.KeepAlive = False
m_hwrRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, application/x-shockwave-flash, */*"
If Not (m_ccCookies Is Nothing) Then
If m_ccCookies.Count > 0 Then
m_hwrRequest.CookieContainer = New CookieContainer()
m_hwrRequest.CookieContainer.Add(m_ccCookies)
End If
End If
If Not (sPOSTData Is Nothing) AndAlso sPOSTData.Length > 0 Then
Dim stWS As Stream
Dim aeEnc As ASCIIEncoding
Dim baBuf As Byte()
aeEnc = New ASCIIEncoding()
baBuf = aeEnc.GetBytes(sPOSTData)
m_hwrRequest.Method = "POST"
m_hwrRequest.ContentLength = baBuf.Length
m_hwrRequest.ContentType = "application/x-www-form-urlencoded"
stWS = m_hwrRequest.GetRequestStream()
stWS.Write(baBuf, 0, baBuf.Length)
stWS.Close()
'm_hwrRequest.AllowAutoRedirect = True
End If
m_hwrResponse = DirectCast(m_hwrRequest.GetResponse(), HttpWebResponse)
srReader = New StreamReader(m_hwrResponse.GetResponseStream())
sReturn = srReader.ReadToEnd()
srReader.Close()
'------------------------------------------------------------------
' capture the redirect location from the header
'------------------------------------------------------------------
Try
Dim wbHCol As WebHeaderCollection = m_hwrResponse.Headers
Dim i As Integer
For i = 0 To wbHCol.Count - 1
Dim header As String = wbHCol.GetKey(i)
Dim values As String() = wbHCol.GetValues(header)
If values.Length > 0 AndAlso header.ToLower = "location" Then
Location1 &= values(0)
End If
Next
Catch
Location1 = String.Empty
End Try
'------------------------------------------------------------------
If Not m_hwrResponse.Headers("Set-Cookie") Is Nothing Then
Dim ccContainer As New CookieContainer()
ccContainer = New CookieContainer()
ccContainer.SetCookies(m_hwrResponse.ResponseUri, m_hwrResponse.Headers("Set-Cookie"))
sTemp = m_hwrResponse.Headers("Set-Cookie").ToString
m_ccCookies.Add(ccContainer.GetCookies(m_hwrResponse.ResponseUri))
End If
Me.Referer = m_hwrResponse.ResponseUri.AbsoluteUri
'close response connection
m_hwrResponse.Close()
Catch ex As Exception
lblError.Text &= "ERROR in fnRequest:" & ex.Message.ToString & vbCrLf
End Try
Return sReturn
End Function
#End Region
Function fnExtract(ByVal sHTML As String, ByVal sVariable As String) As String
Dim options As RegexOptions = RegexOptions.IgnoreCase Or RegexOptions.Multiline
Dim sRE As String = ""
Dim sReturnVal As String = ""
Try
sRE = "<input type=""hidden"" name=""" & sVariable & """ id=""" & sVariable & """ value=""(?<qval>.*)"" />"
Dim rx As Regex = New Regex(sRE, options)
Dim mMatch As Match = rx.Match(sHTML)
If Not mMatch.Success Then
Return ""
End If
'to view the whole string: mMatch.Value
sReturnVal = mMatch.Groups("qval").Value
Catch ex As Exception
lblError.Text = "ERROR in fnExtract: " & ex.ToString
End Try
Return sReturnVal
End Function
End Class
-
Nov 22nd, 2011, 11:21 PM
#5
Thread Starter
Fanatic Member
Re: scraping website that uses ajax
Here is the full POST header taken from Fiddler:
Code:
POST / HTTP/1.1
Accept: */*
Accept-Language: en-us
Referer: http://www.chiefmfg.com/
x-requested-with: XMLHttpRequest
x-microsoftajax: Delta=true
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Cache-Control: no-cache
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; .NET4.0E; .NET4.0C)
Host: www.chiefmfg.com
Content-Length: 47622
Connection: Keep-Alive
Pragma: no-cache
Cookie: ChiefClientLocation=USA; ASP.NET_SessionId=shyamptgqrjapqrf5dni3g01; Coyote-2-c0a81e9b=c0a81e83:0; _mkto_trk=id:095-PKU-280&token:_mch-chiefmfg.com-1321195982578-59922; __utma=69283138.1293993264.1321195983.1321739035.1322020166.4; __utmz=69283138.1321195983.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=69283138.1.10.1322020166; __utmc=69283138
How would I set the x-requested-with and x-microsoftajax doing it the way I am doing it?
x-requested-with: XMLHttpRequest
x-microsoftajax: Delta=true
-
Nov 24th, 2011, 01:32 AM
#6
Re: scraping website that uses ajax
Hi,
I managed to solve this using two options. The first option is to use the Webbrowser class which is a bit
slower,however this class inherits the behavior of IE browser. The drawback is that,it's a bit slower compared to traditional webrequest.
The other solution is to get the correct post data and pass it to the webrequest object. Have you checked the post data showed by Firebug?
Best Regards,
Greg
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|