[RESOLVED] Using AxWebBrowser and DOM to extract data
I am attempting to extract some data from various tables in a web page. I've decided to use the WebBrowser control and DOM to get to the data. I'm struggling with the correct data types to use to access of the results of the DOM methods.
The following code snippet is intended to simply print the number of tables on a web page, and then the size of each table. The first print works, but none of the following do (the exception get's thrown).
I believe the problem is that objTable is simply an object, not a collection of objects.
VB Code:
Private Sub DumpTables()
Dim x As Integer
Dim i, j, k As Integer
Dim s As String
Dim objTable As Object
Dim ObjDoc As Object
' wbrBrowser is the WebBrowser Control on the main form.
ObjDoc = wbrBrowser.Document
Try
objTable = wbrBrowser.Document.getElementsByTagName("TABLE")
x = objTable.length
Debug.WriteLine("Number of Tables " & x)
For i = 0 To x - 1
j = objTable(0).length
Debug.WriteLine("Table: all: " & j _
& " Rows: " & objTable(i).rows.length _
& " Cols: " & objTable(i).cols)
Next
Catch ex As Exception
MessageBox.Show("It is likely that your submit does not exist or has no name attribute. Check the HTML source.", "No name att. or no submit available", MessageBoxButtons.OK, MessageBoxIcon.Exclamation)
End Try
End Sub
Here is a link to the MSDN document that describes the DOM.
It seems like I'm very close to understanding this, but can't figure out what types to use so VB can interpret the results from the DOM calls.
Any pointers or ideas would be appreciated.
Re: Using AxWebBrowser and DOM to extract data
Ok, so I wasn't even close with the attempt above, but based on kleinma's excellent post
here I was able to make some progress.
The following code iterates every cell in every table on a web page. And then more importantly (for me anyway) shows how to directly access any of those cells using a single VB.NET statement. Note that the VB source code is dependent on the contents of the web page, but that is ok for the application I'm working on.
I'm posting this hear as it might help others, and more importantly others may have suggestions as to better ways to do this.
Here is the code:
VB Code:
Private Sub DumpTables()
Dim t, c As Integer ' Used to count tables and cells.
Dim IWebDocument As HTMLDocument
Dim IWebElements As IHTMLElementCollection
Dim ITableElement As HTMLTable
Dim ICellElement As HTMLTableCell
ListBox1.Items.Clear()
'GET DOCUMENT
IWebDocument = CType(wb.Document, HTMLDocument)
'GET TABLES
IWebElements = IWebDocument.getElementsByTagName("TABLE")
ListBox1.Items.Add("Length = " & IWebElements.length)
' Iterate through all the Tables on the web page.
t = 0
For Each ITableElement In IWebElements
ListBox1.Items.Add("Table = " & t & " Length - " & ITableElement.rows.length & " Cells - " & ITableElement.cells.length)
' Iterate through all the cells within a table.
c = 0
For Each ICellElement In ITableElement.cells
ListBox1.Items.Add("Cell ( " & t & "," & c & " -->" & ICellElement.innerText & "<--")
c = c + 1
Next
t = t + 1
Next
' Test directly accessing a few of the table elements. These are hard coded and depend on the page you
' are parsing.
Try
' Extract the innertext from the second cell of the first table.
ListBox1.Items.Add("IWebElements(0,1)" & IWebElements.item(0).cells.item(1).innerText)
' Extract the innertext from the seventh cell of the fourth table.
ListBox1.Items.Add("IWebElements(3,6)" & IWebElements.item(3).cells.item(6).innerText)
Catch ex As Exception
MessageBox.Show("One of the items you tried to reference is invalid. Check the indexes vs. the HTML source.", "Parsing Error", MessageBoxButtons.OK, MessageBoxIcon.Exclamation)
End Try
End Sub