-
Mar 2nd, 2008, 07:22 PM
#1
Thread Starter
Hyperactive Member
Extracting text from a web page <Resolved>
Hello
I have an app that uses text on a web page to guide the populating of a database. The company just changed that format of the page so now the old code no longer works. The old format had the information in a more compact form that had the basic layout intact and was easily parsed. The new format uses a lot more html code to make it more "robust". But in the process it's got a lot more clutter and the basic layout is gone when I extract the source.
So my question is this, before I write a bunch of code to resurrect the layout and remove all the clutter, is there a way in VB to extract text from a web page with the layout intact or at least without all the excess html code.
What I use now is:
Code:
Dim hElm As IHTMLElement
Dim intStart As Integer, intEnd As Integer, TheLength As Integer
Set hElm = brwWebBrowser.Document.All.tags("html").Item(0)
txtWebPagePull = hElm.innerHTML
This code is in the back end of a webbrowser control that the user navigates to the appropriate page and then runs to extract the text.
Thanks
David
Last edited by David RH; Mar 3rd, 2008 at 01:59 AM.
Reason: resolved
-
Mar 2nd, 2008, 07:55 PM
#2
Addicted Member
Re: Extracting text from a web page
WebBrowser1.Document.body.innerhtml
Will give you the page's html.
WebBrowser1.Document.body.innerText
Will give the text only.
I don't know what you mean by "layout". The html controls the layout. So basically you can't just grab text and expect it to remain in table layout and such since the HTML controls the tables.
If it is in frames then you will need to access the frame you want the text from.
WebBrowser1.Document.frames(FRAME_INDEX).Document.body.innertext
Mike
Last edited by MikeJoel; Mar 2nd, 2008 at 08:03 PM.
-
Mar 3rd, 2008, 01:58 AM
#3
Thread Starter
Hyperactive Member
Re: Extracting text from a web page
Originally Posted by MikeJoel
WebBrowser1.Document.body.innerText
Will give the text only.
I don't know what you mean by "layout". The html controls the layout. So basically you can't just grab text and expect it to remain in table layout and such since the HTML controls the tables.
Mike
Thanks Mike
That was what I was looking for.
By layout I mean the spacing and alignment of the characters. True some of the layout is a little off but I can work with that. It's odd that on most lines that the layout is preserved perfectly and others not so well.
These are schedules for the month and the app extracts the schedule and helps me and my fellow employees track our pay. Not that we don't trust the man but they just don't seem to have a good grasp of basic math skills when it comes to calculating our pay. Really odd how the mistakes only seem to go one way.
David
-
Mar 2nd, 2008, 07:59 PM
#4
Re: Extracting text from a web page
Unless the HTML elements have names or ids then you'll have to know the DMO structure and navigate the hierarchy accordingly.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|