Results 1 to 4 of 4

Thread: [VB6] HTML Parsing? Tidy it up first

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Lightbulb [VB6] HTML Parsing? Tidy it up first

    Introduction

    Parsing HTML is a common requirement of a lot of projects. Sadly HTML can have lots of quirks. One of the things people would like to do is just turn the job over to some existing XML parser. Ah, if only life were so simple!

    So because most HTML is so ratty people go to all sorts of lengths to try to write "clean up" code to straighten it out. This seems clever when they first get it working, only to find some time later the HTML page format has changed slightly and their cleanup code doesn't work anymore. Can you say broken program?

    There isn't any cure for this, since Web-scraping is a sucker's game at best. HTML is just too fluid.

    You can improve your program's lifetime though by eschewing hackish, hand-rolled cleanup code and using the Gold Standard for HTML regularization: HTML Tidy.


    Rant

    This is sort of laughable in a way.

    Like so much open source software, HTML Tidy is a horrid trashbag of code. I don't even want to look at the source itself, comments about it all over the Web are strong enough warning! And the TidyLib API is a typical C hacker horror in itself. And don't get me started on its "documentation," much of which consists of links back into C header files full of useless comments.

    Now I'm not doubting that HTML Tidy works most of the time. After all, the task it takes on is horrendously complex (which is why cleanup hacks break so much). Bash on a rat's nest of code long enough and hard enough with persistence enough... and eventually you'll get it to work.

    Question is, who knows how it works anymore or has any confidence in maintaining it? Found on the Web:
    I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.
    The API and documentation make anything Microsoft ever produced look like a thousand miles of smooth pavement.

    So why hasn't a Microsoft, Sun, IBM, et al. taken on the task to provide a better (or at least cleaned up) alternative? You've got me there. Surely the guts inside IE do a better job and the code has to be at least 1000% cleaner and the internal API that much better. It couldn't possibly be as bad, let alone worse.

    Who can say? It is what it is, and people who deal with HTML all over rely on HTML Tidy quite often. I'm not offering to do a better job!


    HTML Tidy/TidyLib

    So that rant aside, lets take a look at this.

    The HTML Tidy Library Project has the documentation as well as links to the standalone EXE version of the product, the TidyLib itself, and many ports and wrappers.

    We could probably Shell the EXE version, but there is a link leading to Charles Reitzel's TidyATL.dll COM Wrapper as well.

    Dave Raggett was the pioneer behind HTML Tidy and TidyLib themselves.

    Note that Gold Standard doesn't necessarily mean best. It is just the standard that most other HTML cleaners compare themselves to.


    TidyATLTest

    That's what I explore in this VB6 testbed Project:
    A testbed for use of the HTML Tidy TidyLib COM wrapper "TidyATL.dll" by Charles Reitzel.

    This program is for exploring the use of TidyATL to clean HTML up so it can be processed as XML.

    In real programs you would fetch the HTML, Tidy it into XML handling any errors and warnings inline, then if ok or "good enough" parse it as XML using an XML DOM or SAX technique.

    Note that options can be set via the DefaultConfig.txt file that we load here as well as by setting them via calls (also shown here for the TidyXMLOut option, also known as output-xml in config files).


    In many cases you might have to set other options to get a given HTML page to "Tidy" without errors. For example some pages now use tags that are not built into HTML Tidy as valid by default. Even Google does this sort of thing. Through the options it is possible to add block or inline tags as "valid."
    Because HTML is such a fluid thing anyway, the definition of "cleaned up" HTML is open to a lot of interpretation. HTML Tidy offers a ton of option settings to help you define what "clean" is as well as defining what your "cleaned output" ought to look like.

    Here I use the output-xml option to try to get a result that an XML parser can easily digest.

    Before you can compile and test this you'll need to obtain and install TidyATL.dll though!


    TidyATL.dll

    Charles did a fine job getting this together as far as it goes.

    However the API reflects most of the worst of the TidyLib C API. it has a set of Class members that is a bit of a nightmare to work with on the one hand, yet leaves out many of the detailed calls of the C API on the other hand.

    Method parameters are often typed Long where they should be an Enum type, or even Boolean by rights.

    It takes a hatful of method calls to load, clean, and extract one HTML page.

    Most of this is a reflection of the C API and the effort it takes to make a normal COM DLL with a more conventional object model. Again, I'm not ready to take the task on myself and TinyATL.dll still saves a ton of clunkiness over calling a Windows build of TinyLib itself.

    One thing Charles didn't bother with is a TidyATL.DEP file, a heinous oversight to me based on how much trouble I see people have with application packaging as it is. I created one and have included it in the attached archive.

    To deploy TidyATL.dll to a development machine:
    • Download Charles' TidyATL package and unzip it.
    • Create a folder C:\Program Files\Common Files\TidyATL (or your machine's equivalent) with elevated rights.
    • Copy TidyATL.dll, TidyATL.dep, and perhaps Charles' readme.txt file into that folder with elevated rights.
    • Start an elevated command prompt.
    • CD to the new TidyATL folder.
    • Run regsvr32 TidyATL.dll


    So Why Bother With HTML Tidy?

    Probably the same reason everyone else uses it: there are few decent alternatives!

    I hope this gets you started. I'm sure others can refine the process, perhaps using the one event TidyATL.TidyDocument exposes (OnMessage) to cature error messages if you need them instead of getting the error file output as I do here.

    Charles Reitzel's archive download also has a VB6 sample program which may help you figure out how to apply his library.
    Attached Images Attached Images  
    Attached Files Attached Files
    Last edited by dilettante; Mar 12th, 2011 at 12:28 PM.

  2. #2
    Fanatic Member
    Join Date
    Mar 2009
    Posts
    804

    Re: [VB6] HTML Parsing? Tidy it up first

    Many thanks for sharing. The example provided with the Com download
    was poorly written, but yours is much better and easy to follow.

    Sort of makes one want to take a shot at writing one. Worlds of
    string manipulation!!

  3. #3

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: [VB6] HTML Parsing? Tidy it up first

    My sample was simplified by only trying to do basically one thing, and by using the output report instead of the error/warning event the TidyDocument object raises.


    Feel free to create one of your own. There are very few tools for this kind of thing that are VB-friendly.

    Tons of them out there, though most are done in slow and sloppy scripting languages. From what I've seen written about them they get points for being less messy code than HTML Tidy, however a lot of them don't do as good a job.

    Lots of them only accept input that is carefully prepared according to a set of constraints. To me that defeats the purpose: I'm starting with messy HTML gleaned from the wild that I need to parse and extract data from.

  4. #4
    New Member
    Join Date
    Apr 2011
    Posts
    13

    Re: [VB6] HTML Parsing? Tidy it up first

    Wow ! Thats great mate !

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width