Extracting text from a web page <Resolved>-VBForums
Results 1 to 4 of 4

Thread: Extracting text from a web page <Resolved>

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    May 2002
    Posts
    434

    Extracting text from a web page <Resolved>

    Hello

    I have an app that uses text on a web page to guide the populating of a database. The company just changed that format of the page so now the old code no longer works. The old format had the information in a more compact form that had the basic layout intact and was easily parsed. The new format uses a lot more html code to make it more "robust". But in the process it's got a lot more clutter and the basic layout is gone when I extract the source.

    So my question is this, before I write a bunch of code to resurrect the layout and remove all the clutter, is there a way in VB to extract text from a web page with the layout intact or at least without all the excess html code.

    What I use now is:
    Code:
    Dim hElm As IHTMLElement
    Dim intStart As Integer, intEnd As Integer, TheLength As Integer
    
    Set hElm = brwWebBrowser.Document.All.tags("html").Item(0)
    txtWebPagePull = hElm.innerHTML
    This code is in the back end of a webbrowser control that the user navigates to the appropriate page and then runs to extract the text.

    Thanks

    David
    Last edited by David RH; Mar 3rd, 2008 at 12:59 AM. Reason: resolved

  2. #2
    Addicted Member
    Join Date
    Jan 2006
    Posts
    248

    Re: Extracting text from a web page

    WebBrowser1.Document.body.innerhtml
    Will give you the page's html.

    WebBrowser1.Document.body.innerText
    Will give the text only.

    I don't know what you mean by "layout". The html controls the layout. So basically you can't just grab text and expect it to remain in table layout and such since the HTML controls the tables.

    If it is in frames then you will need to access the frame you want the text from.
    WebBrowser1.Document.frames(FRAME_INDEX).Document.body.innertext

    Mike
    Last edited by MikeJoel; Mar 2nd, 2008 at 07:03 PM.

  3. #3
    PowerPoster
    Join Date
    Nov 2002
    Location
    Manila
    Posts
    7,629

    Re: Extracting text from a web page

    Unless the HTML elements have names or ids then you'll have to know the DMO structure and navigate the hierarchy accordingly.

  4. #4

    Thread Starter
    Hyperactive Member
    Join Date
    May 2002
    Posts
    434

    Re: Extracting text from a web page

    Quote Originally Posted by MikeJoel
    WebBrowser1.Document.body.innerText
    Will give the text only.

    I don't know what you mean by "layout". The html controls the layout. So basically you can't just grab text and expect it to remain in table layout and such since the HTML controls the tables.

    Mike
    Thanks Mike

    That was what I was looking for.

    By layout I mean the spacing and alignment of the characters. True some of the layout is a little off but I can work with that. It's odd that on most lines that the layout is preserved perfectly and others not so well.

    These are schedules for the month and the app extracts the schedule and helps me and my fellow employees track our pay. Not that we don't trust the man but they just don't seem to have a good grasp of basic math skills when it comes to calculating our pay. Really odd how the mistakes only seem to go one way.

    David

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width

Survey posted by VBForums.