1 Attachment(s)
Instr vs DOM vs MSHTML vs CSS vs ???
I wondering what, if anything, I'm missing.
VB does a great job of finding a character or string with Instr.
Since I'm not well versed in Web page authoring (HTML) I was wondering if there is a better way, other than Instr, to extract (parse) web data. I've read most of this forums posts on web parsing as well as google searches.
Using the attached HTML file (saved in text format) as an example, I want to extract the table near the bottom of the file. A search for "yfnc_tablehead1"
will get to the beginning of the table.
QUESTIONS:
1) The web page appears to list a CSS format. Is this of any value with MSHTML?
2) Is there any way to get a list of all tag codes and deal with this as XML?
3) What about an HTML page that is not XML compliant -- no ending tags?
Thanks
David
Re: Instr vs DOM vs MSHTML vs CSS vs ???
IMHO VB is very versatile when it comes to strings, unfortunately very slow as a consequence. I think there's some HTML controls you can reference that will give you some editing options. However if speed is of the essence I would try dealing with your Web page as a byte array of Ansii chars. Quick to parse and manipulate but more code to write.
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Treating and parsing the data as a byte array is much faster. If you're just parsing small parts of HTML files then the native string manipulation functions (InStr, Mid$, Left$, etc.) should work without a real noticeable difference in speed.
If you are processing large files or a large number of files over and over then that's when you would notice a significant speed difference by using byte arrays.
I think there is a library you can reference in VB (MSXML or something) to parse HTML/XML documents, but I'm not sure, and it's probably pretty slow.
I wrote some string functions for byte arrays (InStr, Left$, Mid$, etc.) if you want to use them. Then you can treat the byte array as you would a string.
http://www.vbforums.com/showthread.php?t=438760
Maybe they will be of some help, or maybe not.:p
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Thanks for response guys.
Had figured since any file is just a series of bytes, using Instr or a byte search would be fastest. Since not very familiar with all the Web options
(HTML, CSS, XML(related to web), etc.,) didn't know whether if there was some quick means to load an array with tags, associated text, and then search array(s) accordingly.
So far don't see applicability of above (HTML, CSS, XML, etc.) for parsing a web page -- just its creation. Any discussion/comments regarding pros/cons related to this would be appreciated.
Re: Instr vs DOM vs MSHTML vs CSS vs ???
I think trying to use a char array is a case of premature and unwarranted optimisation. You'd have to write tons of code to get it to handle all types of web pages. Why bother when the MSHTML library already does all of that for you?
Using the object model approach, parsing is just one step; all of the other manipulation is done on in-memory objects. If you'd use strings, every manipulation would consist of re-parsing from the beginning again. It would be horridly slow in comparison.
Also, the W3C DOM is the universally accepted standard for interfacing with HTML/XML documents; using it will make your algorithm portable to different platforms.
As for the others:
- CSS has nothing to do with document structure;
- There is also the MSXML library, but you're not dealing with XML.
Re: Instr vs DOM vs MSHTML vs CSS vs ???
i usually use split for parsing data, its much easier, and usually faster than a million instr calls.
example:
x="now is the time for all good men"
pos1=instr(x, "the")
pos2=instr(x, " for")
y=mid(x,pos1,pos2)
--try:
y= split(split(x, "the")(1), " for"))(0)
ok, crappy example since the two methods don't match, but you get the idea. remeber thet the string given as the 2nd argument of split get discarded. this is great for when you know the data around the data you want, like in an html page"
2 get all tables in a page:
x=split(html, "<table")
there are many ways to collect HTML elements via DOM.
you can search by an id, get all tags of a certain type, and theres even something called the treewalker object which lets you treat the html elements as an array.
instr is ok for small jobs with one or two calls, but if you want to move to the next level i would go DOM.
if you don't want to learn all of that (understandable) you at least should do a google search for "regex vb6", and look into the vbscript5.0 regular expression library.
think of it as instr on steroids.
not only can you look for a certain string, you can look for unknown strings as well.
a very basic form of regular exresssions is already in vb6, via the LIKE operator. it however, can only match and return yes/no, whereas a regular epression can return the position of a hit, the text of a hit, or a boolean as LIKE does.
it's a little more code upfront, but it you are doing even moderate string parsing, you will find it saves a lot of coding time.
one thing i should fairly note, depending on how you use regex, it can be slower than optimized traditional string handling, by 1.5-3X. if you need absolute processing speed, replace vb6's string functions with the ones from http://www.xbeat.net/vbspeed/
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Thanks for responding Penegate and rnd-me.
Penegate:
I'm Not very familiar with the W3 standard so trying to make a decision whether worth my effort or not for the few times I need to extract data from a webpage.
The first issue as I see it with W3C DOM and an unknown web page is:
1) Determining whether the page can be easily handled. This would be whether there are beginning and ending pairs <, </ for every element on the page. Without this my guess is handling the page would be problematic.
2) Getting a treeview of the all elements and their nesting to determine where the information of interest resides.
3) Once 1 and 2 are determined than using MSHTML to get the data.
So, the question is how many manhours are required to do 1 and 2 considering that the time expended writing MSHTML code would be approximately equal to writing straight VB code (Instr or byte array).
rnd_me
Like MSHTML, regular expressions is another option. Thanks for pointing it out.
=====================
What I'm trying to accomplish with this post is determine:
1) what options are available for getting web page data from an unknown page and
2) what's the best option for extracting this data
IMHO knowing one thing throughly is better than knowning several half-as?.
David
Re: Instr vs DOM vs MSHTML vs CSS vs ???
i though of an as-yet undiscussed possibility.
Since you are talking about raking methods, i assume you just need plain text. do you know about the below function?
It gives you just the text content of the webpage. i use it often when i dont need structured data, but rather am looking for just a small piece of info like a departure time or a temperature. Try running it on your pages, and see if there is static text along the dynamic so that you can pick through much less data. Sometimes makes it night and day easier, sometimes makes it harder by destrying the heirarchy; really depends on your needs.
Let me know if it helps, or if not, explain more about what you are trying to fetch.
Function CleanHTML(TheHTML As String) As String
' CleanHTML = "enable MS HTMLobjectLibrary reference!"
Dim HtmlDOC As New HTMLDocument
HtmlDOC.body.innerHTML = TheHTML
CleanHTML = HtmlDOC.body.innerText
Set HtmlDOC = Nothing
End Function
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Quote:
Originally Posted by dw85745
Penegate:
I'm Not very familiar with the W3 standard so trying to make a decision whether worth my effort or not for the few times I need to extract data from a webpage.
The first issue as I see it with W3C DOM and an unknown web page is:
1) Determining whether the page can be easily handled. This would be whether there are beginning and ending pairs <, </ for every element on the page. Without this my guess is handling the page would be problematic.
Correct. To make things worse, many many pages on the WWW are malformed and invalid HTML. You'd have to handle all of these cases manually if you were to create your own parser. MSHTML (the same parse engine IE uses) is possibly the most forgiving HTML parser that exists. If the page shows up in IE, you'll end up with a result. Since most web pages are designed with IE as their baseline standard, you should have no problems in this regard.
Quote:
Originally Posted by dw85745
2) Getting a treeview of the all elements and their nesting to determine where the information of interest resides.
Piece of cake: simple recursive function to walk the DOM tree.
Quote:
Originally Posted by dw85745
3) Once 1 and 2 are determined than using MSHTML to get the data.
You can skip 2, if you want. The DOM offers several methods for querying the tree. You'd probably use getElementsByTagName.
Quote:
Originally Posted by dw85745
So, the question is how many manhours are required to do 1 and 2 considering that the time expended writing MSHTML code would be approximately equal to writing straight VB code (Instr or byte array).
Not even close. You can load up a DOM structure and grab a list of elements with a few lines of code.
Re: Instr vs DOM vs MSHTML vs CSS vs ???
As an example, for the document attached in the first post, this'll grab the table element containing the first cell in the document with class "yfnc_tablehead1":
Code:
' Pseudo code; I've forgotten some of the actual method names that MSHTML uses.
Dim doc As HTMLDocument
Set doc = New HTMLDocument
doc.LoadHTML("filename.html")
Dim cells As IHTMLElementCollection
Set cells = doc.GetElementsByTagName("td")
Dim cell As IHTMLTableCell
For Each cell In cells
If (cell.className = "yfnc_tablehead1") Then
Dim table As IHTMLTable
' Get td < tr < table:
Set table = cell.parentNode.parentNode
' You can now manipulate the table as you wish.
Exit For
End If
Next
Re: Instr vs DOM vs MSHTML vs CSS vs ???
rnd_me
Familiar with InnerHTML. This is part of my point. That it is easy to just grab webpage text and then do a byte or string search.
Penegate.
With your example the tag of interest is known. Which is ultimately what you want when you need to extract specific data.
But stepping back a level:
Quote:
You can load up a DOM structure and grab a list of elements with a few lines of code.
How do you get the DOM structure from an unknown web page?
My guess is the
Quote:
simple recursive function to walk the DOM tree
Do you have an example you can post?
Thanks
David
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Quote:
How do you get the DOM structure from an unknown web page?
Either download the page and assign its text to a new document or use the createDocumentFromUrl method which I cannot find an example of.
Code:
Dim doc As HTMLDocument
Set doc = New HTMLDocument
doc.body.outerHTML = htmlCode
' You should now be able to query the DOM tree using methods like getElementsByTagName:
Dim tables As IHTMLElementCollection
Set tables = doc.body.GetElementsByTagName("table")
Walking the tree is something like this:
Code:
Function PrintElement(element As IHTMLElement, ByVal indent_level As Long)
Debug.Print Space$(indent_level) & element.nodeName
If (element.hasChildNodes) Then
Dim child As IHTMLElement
For Each child In element.children
PrintElement(child, indent_level + 1)
Next
End If
End Function
' Usage:
PrintElement(doc.body, 0)
But bear in mind you don't need to do that unless you want a graphical representation of it, as it's all in memory anyway.
(I don't have VB6 any more, so I can't test this. Maybe someone who actually has it would be able to correct my code.)
Re: Instr vs DOM vs MSHTML vs CSS vs ???
Penegate:
Thanks for the input. So you've given up VB6 to go with VB.NET?
If you went with VB.NET what program did you use to port your Apps over?
Re: Instr vs DOM vs MSHTML vs CSS vs ???
I don't really use any one particular language predominantly any more. Work gets me to do various things; if it's .NET I use C#. My VB6 apps were hobbyish and I didn't bother porting any of them seriously.
I did try the Visual Studio conversion wizard once, but it did a terrible job. I really don't recommend it. If you're going to port something to .NET, it's best to restructure it to take advantage of the greater OO capabilities, and leverage more of the framework. Otherwise, you'd just end up stuck with something that was trying hard to be VB6 and not succeeding at being anything.