Results 1 to 6 of 6

Thread: scraping website that uses ajax

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2000
    Location
    Minnesota
    Posts
    830

    scraping website that uses ajax

    We have a site that was scraping a site to gather all models available of a product. The 3rd party site recently changed the website so it now uses ajax for users to select the manufacturer and then once they select that it loads a dropdown with products using ajax.

    I currently was using httpwebrequest for all requests (see below).
    Code:
    Public Function fnRequest(ByVal sPOSTData As String, Optional ByVal bAutoRedirect As Boolean = False) As String
            Dim uriSite As Uri
            Dim sReturn As String
            Dim srReader As StreamReader
            Dim sTemp As String
    
            sReturn = String.Empty
            Try
                ' Setup request
                uriSite = New Uri(m_sURL)
                m_hwrRequest = DirectCast(WebRequest.Create(uriSite), HttpWebRequest)
                m_hwrRequest.Referer = m_sReferer
                m_hwrRequest.UserAgent = m_sUserAgent
                m_hwrRequest.AllowAutoRedirect = bAutoRedirect
                m_hwrRequest.AllowWriteStreamBuffering = True
                m_hwrRequest.KeepAlive = False
                m_hwrRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, application/x-shockwave-flash, */*"
    
                If Not (m_ccCookies Is Nothing) Then
                    If m_ccCookies.Count > 0 Then
                        m_hwrRequest.CookieContainer = New CookieContainer()
                        m_hwrRequest.CookieContainer.Add(m_ccCookies)
                    End If
                End If
                
                If Not (sPOSTData Is Nothing) AndAlso sPOSTData.Length > 0 Then
                    Dim stWS As Stream
                    Dim aeEnc As ASCIIEncoding
                    Dim baBuf As Byte()
    
                    aeEnc = New ASCIIEncoding()
                    baBuf = aeEnc.GetBytes(sPOSTData)
    
                    m_hwrRequest.Method = "POST"
                    m_hwrRequest.ContentLength = baBuf.Length
                    m_hwrRequest.ContentType = "application/x-www-form-urlencoded"
    
                    stWS = m_hwrRequest.GetRequestStream()
                    stWS.Write(baBuf, 0, baBuf.Length)
                    stWS.Close()
                    'm_hwrRequest.AllowAutoRedirect = True
                End If
    
                m_hwrResponse = DirectCast(m_hwrRequest.GetResponse(), HttpWebResponse)
    
                srReader = New StreamReader(m_hwrResponse.GetResponseStream())
                sReturn = srReader.ReadToEnd()
                srReader.Close()
    
                '------------------------------------------------------------------
                ' capture the redirect location from the header
                '------------------------------------------------------------------
                Try
                    Dim wbHCol As WebHeaderCollection = m_hwrResponse.Headers
                    Dim i As Integer
                    For i = 0 To wbHCol.Count - 1
    
                        Dim header As String = wbHCol.GetKey(i)
                        Dim values As String() = wbHCol.GetValues(header)
    
                        If values.Length > 0 AndAlso header.ToLower = "location" Then
                            Location1 &= values(0)
                        End If
                    Next
                Catch
                    Location1 = String.Empty
                End Try
                '------------------------------------------------------------------
    
                If Not m_hwrResponse.Headers("Set-Cookie") Is Nothing Then
                    Dim ccContainer As New CookieContainer()
    
                    ccContainer = New CookieContainer()
                    ccContainer.SetCookies(m_hwrResponse.ResponseUri, m_hwrResponse.Headers("Set-Cookie"))
                    sTemp = m_hwrResponse.Headers("Set-Cookie").ToString
                    m_ccCookies.Add(ccContainer.GetCookies(m_hwrResponse.ResponseUri))
                End If
    
                Me.Referer = m_hwrResponse.ResponseUri.AbsoluteUri
    
                'close response connection
                m_hwrResponse.Close()
    
            Catch ex As Exception
                lblError.Text &= ex.Message.ToString
            End Try
    
            Return sReturn
        End Function
    #End Region
    Now in fiddler the post appears to be done with ajax. I tried to send the post data the normal way but it didn't like that.

    Does anyone have an example of how to do this? To get an idea go to http://www.chiefmfg.com/ and see Mount finder and select Projector.

    Thanks to any info in advance.

  2. #2
    PowerPoster gep13's Avatar
    Join Date
    Nov 2004
    Location
    The Granite City
    Posts
    21,963

    Re: scraping website that uses ajax

    When you say that it didn't like it, what exactly do you mean? Can you elaborate?

    Gary

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2000
    Location
    Minnesota
    Posts
    830

    Re: scraping website that uses ajax

    This is the response I get

    1|#||4|58|pageRedirect||%2fApplicationError.aspx%3faspxerrorpath%3d%2fDefault.aspx|

    I can pm code if you like or post online somewhere.

    Thanks.

  4. #4

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2000
    Location
    Minnesota
    Posts
    830

    Re: scraping website that uses ajax

    Here is the code I have so far.

    Code:
    Imports System.IO
    Imports System.Net
    Imports System.Data
    
    Partial Class chief_ss
        Inherits System.Web.UI.Page
    
    #Region " Instance variables "
        Private m_sUserAgent As String
        Private m_hwrRequest As HttpWebRequest
        Private m_hwrResponse As HttpWebResponse
        Private m_ccCookies As CookieCollection
        Private m_sReferer As String
        Private m_sURL As String
        Private m_sLocation As String
    #End Region
    
    #Region " Properties "
        Public ReadOnly Property Cookies() As CookieCollection
            Get
                Return m_ccCookies
            End Get
        End Property
    
        Public Property URL() As String
            Get
                Return m_sURL
            End Get
            Set(ByVal Value As String)
                m_sURL = Value
            End Set
        End Property
    
        Public Property Referer() As String
            Get
                Return m_sReferer
            End Get
            Set(ByVal Value As String)
                m_sReferer = Value
            End Set
        End Property
    
    
        Public Property Location1() As String
            Get
                Return m_sLocation
            End Get
            Set(ByVal Value As String)
                m_sLocation = Value
            End Set
        End Property
    #End Region
    
    #Region " Constants "
        Private Const CS_URL_LOGIN As String = _
        "http://www.chiefmfg.com"
    
        Private Const CS_URL_POST_GET_PROJECTORS As String = _
            "ctl00%24ctl00%24ScriptManager1=ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24upMountFinderCascading%7Cctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType%241&__EVENTTARGET=ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType%241&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE={0}&__EVENTVALIDATION={1}&ctl00%24ctl00%24ctrlNavBar%24txtSearchBox=&ctl00%24ctl00%24ctrlNavBar%24hfKeywordValue=&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24radProductType=Projector&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24ddlManufacturers=&ctl00%24ctl00%24Content%24Content%24ctrlMountFinder%24ddlModels=&__ASYNCPOST=true&"
    #End Region
        
        Public clsE As clsSendEmail = New clsSendEmail
    
        Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
            m_sURL = String.Empty
            m_sReferer = String.Empty
            m_sUserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
            m_hwrRequest = Nothing
            m_hwrResponse = Nothing
            m_ccCookies = New CookieCollection()
            m_sLocation = 0
    
        End Sub
    
        Protected Sub btnRun_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles btnRun.Click
            m_sURL = "http://www.chiefmfg.com"
            Dim sHTML As String = fnRequest("")
            Dim sViewState As String = fnExtract(sHTML, "__VIEWSTATE")
            Dim sEventValidation As String = fnExtract(sHTML, "__EVENTVALIDATION")
    
            sHTML = fnRequest(String.Format(CS_URL_POST_GET_PROJECTORS, sViewState, sEventValidation))
    
            txtOutput.Text = sHTML
        End Sub
    
    #Region " Public Request Functions/Subroutines "
        Public Function fnRequest(ByVal sPOSTData As String, Optional ByVal bAutoRedirect As Boolean = False) As String
            Dim uriSite As Uri
            Dim sReturn As String
            Dim srReader As StreamReader
            Dim sTemp As String
    
            sReturn = String.Empty
            Try
                ' Setup request
                uriSite = New Uri(m_sURL)
                m_hwrRequest = DirectCast(WebRequest.Create(uriSite), HttpWebRequest)
                m_hwrRequest.Referer = m_sReferer
                m_hwrRequest.UserAgent = m_sUserAgent
                m_hwrRequest.AllowAutoRedirect = bAutoRedirect
                m_hwrRequest.AllowWriteStreamBuffering = True
                m_hwrRequest.KeepAlive = False
                m_hwrRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, application/x-shockwave-flash, */*"
    
                If Not (m_ccCookies Is Nothing) Then
                    If m_ccCookies.Count > 0 Then
                        m_hwrRequest.CookieContainer = New CookieContainer()
                        m_hwrRequest.CookieContainer.Add(m_ccCookies)
                    End If
                End If
                
                If Not (sPOSTData Is Nothing) AndAlso sPOSTData.Length > 0 Then
                    Dim stWS As Stream
                    Dim aeEnc As ASCIIEncoding
                    Dim baBuf As Byte()
    
                    aeEnc = New ASCIIEncoding()
                    baBuf = aeEnc.GetBytes(sPOSTData)
    
                    m_hwrRequest.Method = "POST"
                    m_hwrRequest.ContentLength = baBuf.Length
                    m_hwrRequest.ContentType = "application/x-www-form-urlencoded"
    
                    stWS = m_hwrRequest.GetRequestStream()
                    stWS.Write(baBuf, 0, baBuf.Length)
                    stWS.Close()
                    'm_hwrRequest.AllowAutoRedirect = True
                End If
    
                m_hwrResponse = DirectCast(m_hwrRequest.GetResponse(), HttpWebResponse)
    
                srReader = New StreamReader(m_hwrResponse.GetResponseStream())
                sReturn = srReader.ReadToEnd()
                srReader.Close()
    
                '------------------------------------------------------------------
                ' capture the redirect location from the header
                '------------------------------------------------------------------
                Try
                    Dim wbHCol As WebHeaderCollection = m_hwrResponse.Headers
                    Dim i As Integer
                    For i = 0 To wbHCol.Count - 1
    
                        Dim header As String = wbHCol.GetKey(i)
                        Dim values As String() = wbHCol.GetValues(header)
    
                        If values.Length > 0 AndAlso header.ToLower = "location" Then
                            Location1 &= values(0)
                        End If
                    Next
                Catch
                    Location1 = String.Empty
                End Try
                '------------------------------------------------------------------
    
                If Not m_hwrResponse.Headers("Set-Cookie") Is Nothing Then
                    Dim ccContainer As New CookieContainer()
    
                    ccContainer = New CookieContainer()
                    ccContainer.SetCookies(m_hwrResponse.ResponseUri, m_hwrResponse.Headers("Set-Cookie"))
                    sTemp = m_hwrResponse.Headers("Set-Cookie").ToString
                    m_ccCookies.Add(ccContainer.GetCookies(m_hwrResponse.ResponseUri))
                End If
    
                Me.Referer = m_hwrResponse.ResponseUri.AbsoluteUri
    
                'close response connection
                m_hwrResponse.Close()
    
            Catch ex As Exception
                lblError.Text &= "ERROR in fnRequest:" & ex.Message.ToString & vbCrLf
            End Try
    
            Return sReturn
        End Function
    #End Region
    
        Function fnExtract(ByVal sHTML As String, ByVal sVariable As String) As String
            Dim options As RegexOptions = RegexOptions.IgnoreCase Or RegexOptions.Multiline
            Dim sRE As String = ""
            Dim sReturnVal As String = ""
    
            Try
                sRE = "<input type=""hidden"" name=""" & sVariable & """ id=""" & sVariable & """ value=""(?<qval>.*)"" />"
    
                Dim rx As Regex = New Regex(sRE, options)
                Dim mMatch As Match = rx.Match(sHTML)
                If Not mMatch.Success Then
                    Return ""
                End If
    
                'to view the whole string: mMatch.Value
                sReturnVal = mMatch.Groups("qval").Value
    
            Catch ex As Exception
                lblError.Text = "ERROR in fnExtract: " & ex.ToString
            End Try
    
            Return sReturnVal
    
        End Function
    
    
    End Class

  5. #5

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2000
    Location
    Minnesota
    Posts
    830

    Re: scraping website that uses ajax

    Here is the full POST header taken from Fiddler:
    Code:
    POST / HTTP/1.1
    Accept: */*
    Accept-Language: en-us
    Referer: http://www.chiefmfg.com/
    x-requested-with: XMLHttpRequest
    x-microsoftajax: Delta=true
    Content-Type: application/x-www-form-urlencoded; charset=utf-8
    Cache-Control: no-cache
    Accept-Encoding: gzip, deflate
    User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; .NET4.0E; .NET4.0C)
    Host: www.chiefmfg.com
    Content-Length: 47622
    Connection: Keep-Alive
    Pragma: no-cache
    Cookie: ChiefClientLocation=USA; ASP.NET_SessionId=shyamptgqrjapqrf5dni3g01; Coyote-2-c0a81e9b=c0a81e83:0; _mkto_trk=id:095-PKU-280&token:_mch-chiefmfg.com-1321195982578-59922; __utma=69283138.1293993264.1321195983.1321739035.1322020166.4; __utmz=69283138.1321195983.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=69283138.1.10.1322020166; __utmc=69283138
    How would I set the x-requested-with and x-microsoftajax doing it the way I am doing it?
    x-requested-with: XMLHttpRequest
    x-microsoftajax: Delta=true

  6. #6
    Frenzied Member KGComputers's Avatar
    Join Date
    Dec 2005
    Location
    Cebu, PH
    Posts
    2,024

    Re: scraping website that uses ajax

    Hi,

    I managed to solve this using two options. The first option is to use the Webbrowser class which is a bit
    slower,however this class inherits the behavior of IE browser. The drawback is that,it's a bit slower compared to traditional webrequest.

    The other solution is to get the correct post data and pass it to the webrequest object. Have you checked the post data showed by Firebug?

    Best Regards,


    Greg
    CodeBank: VB.NET & C#.NET | ASP.NET
    Programming: C# | VB.NET
    Blogs: Personal | Programming
    Projects: GitHub | jsFiddle
    ___________________________________________________________________________________

    Rating someone's post is a way of saying Thanks...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width