Getting Values from HTML page
Hello!
I'm making program that should analyze data from HTML pages and put them into one XML file. Pages are similar and displaying info about companies. I tried to work with InnerText of Body (htmldocument.body.innertext) but i need something better because i can't locate correct information in InnerText string because of content movment (location of data that should be extracted is different on every page as consequence of changings in header of page, for example, company name is not same in width as on first page.... you know what I mean.)
I have some idea now, to solve it with OuterText with counting HTML tags, but is there some way to search htmldocument.body.children collection with children value as search key and to get some identificational string. ?
there must be some way to search HTML document on structured way...
here i need identification of required child that will point on child like they are all in same layer, childs of same parent node.
Re: Getting Values from HTML page
Not sure about searching the object model, as I havent had much experience with it, but if you can read the HTML file into a string, you can use Regex in order to match on the tags you wish...
Below is an example of getting everything between <p> tags, and returns just the text that is inside of the <p>...</p> tags
VB Code:
Dim mystring As String = "<p>this is a paragraph</p> <p> and this is another paragraph </p>"
Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<p>).*?(?=</p>)")
Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(mystring)
For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches
MsgBox(FoundMatch.Value)
Next
Regex can be pretty powerful if you know the right syntax. If you have a more specific example, and wish to pursue this option, we could give you pointers on how to modify the above to try to get what you wish...
Re: Getting Values from HTML page
This seems like very good solution!!!
Thanks a lot!
I'll use this method, but if someone else know better post here, some searchers maybe find it useful.
Re: Getting Values from HTML page
it seems that my values are in <td></td>, so i tried:
VB Code:
Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<td>).*?(?=</td>)")
But there are some tags in HTML like:
<td> Telefon</td><td colSpan="2">022/532-05</td>
</tr>
so i need a way to match this text with no influence of this ColSpan atribute, or any other HTML tag atribute.
From this HTML sample, I want to get only " Telefon" and "022/532-05"
i tried to figure out this REGEX syntax alone, but it is quite hard !
Can you please give me snippet for this?
Thanks!
Re: Getting Values from HTML page
VB Code:
Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<td[^/]+>).*?(?=</td>)")
this one works but getting only tags that has space or attributes,
normal tags doesn't get
...
Re: Getting Values from HTML page
I was having the same issues till you posted that example :) Put "[^/]+" inside of parenthesis, with a "?" after it... like below:
VB Code:
Dim mystring As String = "<p colspan='2'>this is a paragraph</p> <p> and this is another paragraph </p>"
Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<p([^/]+)?>).*?(?=</p>)")
Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(mystring)
For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches
MsgBox(FoundMatch.Value)
Next