Results 1 to 11 of 11

Thread: [02/03] Download and Parse HTML

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    [02/03] Download and Parse HTML

    I am attempting to follow this tutorial, but it seems to be for ASP instead of VB .NET 2003.
    http://www.pscode.com/vb/scripts/Sho...=339&lngWId=10

    I am having lots of trouble trying to figure out how to make it work. I was wondering if you know of a good tutorial or can make this work somehow? I attached my work so far. Thanks.
    Attached Files Attached Files
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  2. #2
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: [02/03] Download and Parse HTML

    Very few people would download a zip file just to have a look at your code... You'll probably get more response if you just post your code directly on the thread instead of having it as a download.

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    Re: [02/03] Download and Parse HTML

    Quote Originally Posted by stanav
    Very few people would download a zip file just to have a look at your code... You'll probably get more response if you just post your code directly on the thread instead of having it as a download.

    Sure, here is my code:
    Code:
    Imports System.IO
    Imports System.Net
    Imports System
    Imports System.Text
    Imports System.Text.RegularExpressions
    Public Class Form1
    
        Inherits System.Windows.Forms.Form
    
    #Region " Windows Form Designer generated code "
    
        Public Sub New()
            MyBase.New()
    
            'This call is required by the Windows Form Designer.
            InitializeComponent()
    
            'Add any initialization after the InitializeComponent() call
    
        End Sub
    
        'Form overrides dispose to clean up the component list.
        Protected Overloads Overrides Sub Dispose(ByVal disposing As Boolean)
            If disposing Then
                If Not (components Is Nothing) Then
                    components.Dispose()
                End If
            End If
            MyBase.Dispose(disposing)
        End Sub
    
        'Required by the Windows Form Designer
        Private components As System.ComponentModel.IContainer
    
        'NOTE: The following procedure is required by the Windows Form Designer
        'It can be modified using the Windows Form Designer.  
        'Do not modify it using the code editor.
        Friend WithEvents txtHTMLContent As System.Windows.Forms.TextBox
        <System.Diagnostics.DebuggerStepThrough()> Private Sub InitializeComponent()
            Me.txtHTMLContent = New System.Windows.Forms.TextBox
            Me.SuspendLayout()
            '
            'txtHTMLContent
            '
            Me.txtHTMLContent.Location = New System.Drawing.Point(112, 64)
            Me.txtHTMLContent.Name = "txtHTMLContent"
            Me.txtHTMLContent.TabIndex = 0
            Me.txtHTMLContent.Text = "TextBox1"
            '
            'Form1
            '
            Me.AutoScaleBaseSize = New System.Drawing.Size(5, 13)
            Me.ClientSize = New System.Drawing.Size(292, 266)
            Me.Controls.Add(Me.txtHTMLContent)
            Me.Name = "Form1"
            Me.Text = "Form1"
            Me.ResumeLayout(False)
    
        End Sub
    
    #End Region
        '////////////////////////////////////////////////////////////////////////////// 
        Private objParser As HTMLContentParser
        Private Sub cmdGetHTML_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdGetHTML.ServerClick
            Dim sURL As String = "http://" & txtURL.Value
            txtHTMLContent.EnableViewState = False
            txtHTMLContent.Value = objParser.Return_HTMLContent(sURL)
        End Sub
        Private Sub cmdParse_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdParse.ServerClick
            Call PopulatetblParsedContent()
        End Sub
        Private Sub PopulatetblParsedContent() 'Populate Links Table
            Dim sURL As String = "http://" & txtURL.Value
            Dim myAnchor As HtmlAnchor
            Dim intRows As Integer
            Dim intRowCount As Integer
            Dim objRow As HtmlTableRow
            Dim objCell As HtmlTableCell
            Dim sLinks As String
            Dim sImage As String
            Dim lstLinks As ArrayList = objParser.ParseHTMLLinks(txtHTMLContent.Value, sURL)
            Dim lstImages As ArrayList = objParser.ParseHTMLImages(txtHTMLContent.Value, sURL)
            tblParsedContent = Me.tblParsedContent
            tblParsedContent.EnableViewState = False
            For Each sLinks In lstLinks
                objRow = New HtmlTableRow
                objCell = New HtmlTableCell
                myAnchor = New HtmlAnchor
                myAnchor.Target = "_blank"
                myAnchor.InnerText = "Link: " & sLinks.ToString
                myAnchor.HRef = sLinks.ToString
                objCell.NoWrap = False
                objCell.Controls.Add(myAnchor)
                objRow.Cells.Add(objCell)
                tblParsedContent.Rows.Add(objRow)
            Next
            For Each sImage In lstImages
                objRow = New HtmlTableRow
                objCell = New HtmlTableCell
                myAnchor = New HtmlAnchor
                myAnchor.Target = "_blank"
                myAnchor.InnerText = "Img: " & sImage.ToString
                myAnchor.HRef = sImage.ToString
                objCell.NoWrap = False
                objCell.Controls.Add(myAnchor)
                objRow.Cells.Add(objCell)
                tblParsedContent.Rows.Add(objRow)
            Next
        End Sub
        '/////////////////////////////////////////////////////////////////////////////////
        Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    
        End Sub
    End Class
    '''''''''''''''''''''''''''''''''''''''''''''''''''''
    Imports System.IO
    Imports System.Net
    Imports System
    Imports System.Text
    Imports System.Text.RegularExpressions
    Public Class HTMLContentParser
      Function Return_HTMLContent(ByVal sURL As String)
        Dim sStream As Stream
        Dim URLReq As HttpWebRequest
        Dim URLRes As HttpWebResponse
    
        Try
    
          URLReq = WebRequest.Create(sURL)
          URLRes = URLReq.GetResponse()
    
          sStream = URLRes.GetResponseStream()
          Return New StreamReader(sStream).ReadToEnd()
    
        Catch ex As Exception
    
          Return ex.Message
    
        End Try
      End Function
    
      Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
        Dim rRegEx As Regex
        Dim mMatch As Match
        Dim aMatch As New ArrayList()
    
        rRegEx = New Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _
        RegexOptions.IgnoreCase Or RegexOptions.Compiled)
    
        mMatch = rRegEx.Match(sHTMLContent)
    
        While mMatch.Success
          Dim sMatch As String
          sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
          aMatch.Add(sMatch)
          mMatch = mMatch.NextMatch()
        End While
    
        Return aMatch
    
      End Function
    
    Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
        Dim rRegEx As Regex
        Dim mMatch As Match
        Dim aMatch As New ArrayList()
    
        rRegEx = New Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _
        RegexOptions.IgnoreCase Or RegexOptions.Compiled)
    
        mMatch = rRegEx.Match(sHTMLContent)
    
        While mMatch.Success
          Dim sMatch As String
          sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
          aMatch.Add(sMatch)
          mMatch = mMatch.NextMatch()
        End While
    
        Return aMatch
    
      End Function
    
      Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
    
        'Find out if the sURL has a "/" after the Domain Name
        'If not, give a "/" at the end
        'First, check out for any slash after the
        'Double Dashes of the http://
        'If there is NO slash, then end the sURL string with a SLASH
        If InStr(8, sURL, "/") = 0 Then
          sURL += "/"
        End If
    
        'FILTERING
        'Filter down to the Domain Name Directory from the Right
        Dim iCount As Integer
        For iCount = sURL.Length To 1 Step -1
          If Mid(sURL, iCount, 1) = "/" Then
            sURL = Left(sURL, iCount)
            Exit For
          End If
        Next
        'Filter out the ">" from the Left
        For iCount = 1 To sInput.Length
          If Mid(sInput, iCount, 4) = "&gt;" Then
            sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
            Exit For
          End If
        Next
    
        'Filter out unnecessary Characters
        sInput = sInput.Replace("&lt;", Chr(39))
        sInput = sInput.Replace("&gt;", Chr(39))
        sInput = sInput.Replace("&quot;", "")
        sInput = sInput.Replace("'", "")
    
        If (sInput.IndexOf("http://") < 0) Then
          If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
            Return sURL & "/" & sInput
          Else
            If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
              Return sURL.Substring(0, sURL.Length - 1) + sInput
            Else
              Return sURL + sInput
            End If
          End If
        Else
          Return sInput
        End If
      End Function
    End Class
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  4. #4
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: [02/03] Download and Parse HTML

    What exactly are you trying to parse out from a web page's source code?

  5. #5

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    Re: [02/03] Download and Parse HTML

    Well, lots of things. I would like to start with links and work my way up to images and portions of the text. I have many uses for this, and am very exited to make it work, but I am pretty new at VB .NET and having lots of trouble finding help compared to regular VB.
    Quote Originally Posted by stanav
    What exactly are you trying to parse out from a web page's source code?
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  6. #6
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: [02/03] Download and Parse HTML

    Quote Originally Posted by rex64
    Well, lots of things. I would like to start with links and work my way up to images and portions of the text. I have many uses for this, and am very exited to make it work, but I am pretty new at VB .NET and having lots of trouble finding help compared to regular VB.
    If so then you should try to use a WebBrowser object to navigate to your desired URL and handle the WebBrowser.DocumentCompleted event. In this event handler, you read the WB.Document and work with it to get what you need... My suggestion is go to MSDN library and read up the documentation for WebBrowser class, especially its members.
    This is an overly simplified example of using a webbrowser to parse all the links and append them to a richtextbox. Add a textbox, a button and a richtextbox to your form, then paste the code below in you code page
    vb Code:
    1. Private WithEvents WB As New WebBrowser
    2.  
    3.     Private Sub WB_DocumentCompleted(ByVal sender As Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WB.DocumentCompleted
    4.         Dim lnkCollection As HtmlElementCollection = WB.Document.Links
    5.         For Each lnk As HtmlElement In lnkCollection
    6.             Me.RichTextBox1.AppendText(lnk.GetAttribute("href") & ControlChars.NewLine)
    7.         Next
    8.     End Sub
    9.  
    10.     Private Sub Button1_Click_1(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    11.         WB.Navigate(Me.TextBox1.Text)
    12.     End Sub
    Last edited by stanav; Sep 11th, 2007 at 08:49 AM.

  7. #7

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    Re: [02/03] Download and Parse HTML

    That looks very cool. I did a bunch of research and found out that I need to the "Microsoft Web Browser" From the COM objects. I did this, however I am still getting this error:
    C:\Reports\Jeff\VB\HTMLreasearcher\Form1.vb(88): Type 'WebBrowser' is not defined.


    Here is my code:
    Code:
    Private WithEvents WB As New WebBrowser
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  8. #8

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    Re: [02/03] Download and Parse HTML

    (please see last post, I still need this answered)

    I found this code, but I am not sure how to use it. I tried putting it under the imports and I tried putting it inside the class. Also, it keeps telling me I need another '>'
    http://msdn2.microsoft.com/en-us/lib...er(VS.80).aspx

    Also, should I just upgrade to 2005, or whatever the newest is? Because it seems like it supports more and more people use it?
    Code:
    <ComVisibleAttribute(True)> _
    <ClassInterfaceAttribute(ClassInterfaceType.AutoDispatch)> _
    Public Class WebBrowser
        Inherits WebBrowserBase
    Last edited by rex64; Sep 12th, 2007 at 09:50 PM.
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  9. #9
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: [02/03] Download and Parse HTML

    If you can upgrade to 2005, I highly recommend it. In the mean time, you can download the VB.Net 2005 Express for free and try it out.
    That aside, for .net 1.1, you can use a webclicent to download the webpage, then load it to an IHTMLDocument2 object. Once this is done, the rest is the same
    Here's what to do:
    1. Add a reference to "Microsoft HTML Object Library" (it's in the COM tab)
    2. Add a textbox, a richtextbox and a button to your form
    3. Paste the following code
    Code:
    Imports mshtml
    Imports System.Net
    
    Private WithEvents WC As New WebClient
        Private htmlDoc As IHTMLDocument2 = New mshtml.HTMLDocument
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            'First use the web client to download data and convert it to html string
            Dim url As String = Me.TextBox1.Text
            Dim B() As Byte = WC.DownloadData(url)
            Dim html As String = System.Text.Encoding.Default.GetString(B)
    
            'Next we use the COM IHTMLDocument2 interface to load the html string
            htmlDoc.clear()
            htmlDoc.write(html)
            htmlDoc.close()
    
            'Make sure that the document is fully loaded
            While (htmlDoc.readyState <> "complete")
                System.Threading.Thread.Sleep(1000)
                Application.DoEvents()
            End While
    
            'Now we're ready to parse the document to get the links
            Me.RichTextBox1.Clear()
            Dim lnkCollection As IHTMLElementCollection = htmlDoc.links
            For Each lnk As IHTMLElement In lnkCollection
                Me.RichTextBox1.AppendText(CStr(lnk.getAttribute("href")) & ControlChars.NewLine)
            Next
    
        End Sub

  10. #10

    Thread Starter
    Fanatic Member
    Join Date
    Dec 2006
    Location
    Florida, USA
    Posts
    565

    Re: [02/03] Download and Parse HTML

    Ok, I got that working. I am using Visual Studio 2005 now. My problem now is that I am I would like to also be able to post forms and get data back. For example, I would like to see the results from this form:
    http://www.studyx.com/contactform.html
    I use VB .NET 2022. Currently developing StudyX educational software, PlazSales POS system and Yargis a space ship shooter game.

  11. #11
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: [02/03] Download and Parse HTML

    Take a look at this thread by Kleinma in the Code Bank
    http://www.vbforums.com/showthread.php?t=416275

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width