Results 1 to 6 of 6

Thread: Getting Values from HTML page

  1. #1

    Thread Starter
    Member hardyvoje's Avatar
    Join Date
    Feb 2006
    Location
    Serbia
    Posts
    38

    Getting Values from HTML page

    Hello!

    I'm making program that should analyze data from HTML pages and put them into one XML file. Pages are similar and displaying info about companies. I tried to work with InnerText of Body (htmldocument.body.innertext) but i need something better because i can't locate correct information in InnerText string because of content movment (location of data that should be extracted is different on every page as consequence of changings in header of page, for example, company name is not same in width as on first page.... you know what I mean.)

    I have some idea now, to solve it with OuterText with counting HTML tags, but is there some way to search htmldocument.body.children collection with children value as search key and to get some identificational string. ?

    there must be some way to search HTML document on structured way...

    here i need identification of required child that will point on child like they are all in same layer, childs of same parent node.
    Free tutorials, web templates, images, 3D models and other design&developing resources at: http://www.omnetwork.net | Open Source Gaming Portal: www.osgamer.org

  2. #2
    PowerPoster
    Join Date
    Aug 2005
    Location
    College Station, TX
    Posts
    4,521

    Re: Getting Values from HTML page

    Not sure about searching the object model, as I havent had much experience with it, but if you can read the HTML file into a string, you can use Regex in order to match on the tags you wish...

    Below is an example of getting everything between <p> tags, and returns just the text that is inside of the <p>...</p> tags
    VB Code:
    1. Dim mystring As String = "<p>this is a paragraph</p> <p> and this is another paragraph </p>"
    2.         Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<p>).*?(?=</p>)")
    3.         Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(mystring)
    4.         For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches
    5.             MsgBox(FoundMatch.Value)
    6.         Next
    Regex can be pretty powerful if you know the right syntax. If you have a more specific example, and wish to pursue this option, we could give you pointers on how to modify the above to try to get what you wish...

  3. #3

    Thread Starter
    Member hardyvoje's Avatar
    Join Date
    Feb 2006
    Location
    Serbia
    Posts
    38

    Re: Getting Values from HTML page

    This seems like very good solution!!!
    Thanks a lot!

    I'll use this method, but if someone else know better post here, some searchers maybe find it useful.
    Free tutorials, web templates, images, 3D models and other design&developing resources at: http://www.omnetwork.net | Open Source Gaming Portal: www.osgamer.org

  4. #4

    Thread Starter
    Member hardyvoje's Avatar
    Join Date
    Feb 2006
    Location
    Serbia
    Posts
    38

    Re: Getting Values from HTML page

    it seems that my values are in <td></td>, so i tried:
    VB Code:
    1. Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<td>).*?(?=</td>)")

    But there are some tags in HTML like:
    <td> Telefon</td><td colSpan="2">022/532-05</td>
    </tr>

    so i need a way to match this text with no influence of this ColSpan atribute, or any other HTML tag atribute.

    From this HTML sample, I want to get only " Telefon" and "022/532-05"
    i tried to figure out this REGEX syntax alone, but it is quite hard !

    Can you please give me snippet for this?
    Thanks!
    Free tutorials, web templates, images, 3D models and other design&developing resources at: http://www.omnetwork.net | Open Source Gaming Portal: www.osgamer.org

  5. #5

    Thread Starter
    Member hardyvoje's Avatar
    Join Date
    Feb 2006
    Location
    Serbia
    Posts
    38

    Re: Getting Values from HTML page

    VB Code:
    1. Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<td[^/]+>).*?(?=</td>)")

    this one works but getting only tags that has space or attributes,
    normal tags doesn't get

    ...
    Free tutorials, web templates, images, 3D models and other design&developing resources at: http://www.omnetwork.net | Open Source Gaming Portal: www.osgamer.org

  6. #6
    PowerPoster
    Join Date
    Aug 2005
    Location
    College Station, TX
    Posts
    4,521

    Re: Getting Values from HTML page

    I was having the same issues till you posted that example Put "[^/]+" inside of parenthesis, with a "?" after it... like below:
    VB Code:
    1. Dim mystring As String = "<p colspan='2'>this is a paragraph</p> <p> and this is another paragraph </p>"
    2.         Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<p([^/]+)?>).*?(?=</p>)")
    3.         Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(mystring)
    4.         For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches
    5.             MsgBox(FoundMatch.Value)
    6.         Next

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width