Results 1 to 7 of 7

Thread: [Resolved] Any help with this formula [HTMLAgility-Pack]

  1. #1

    Thread Starter
    Member
    Join Date
    Apr 2011
    Posts
    35

    [Resolved] Any help with this formula [HTMLAgility-Pack]

    Hi VB people

    GOT a bit of a pickle, i have been working with HTML-AP for a month or so and loving it, also learning as i go along. Currently i am picking at different sites (for work) and i am currently stuck, well not stuck as i have got around it using the code below, but is there a way to get the "text" that i need.

    Site-Link

    So i am after the product name which is located in a H2 tag BUT whenever i try to pull the innertext from the tag it pulls the whole lot from the content column and its got me baffled?

    I am currently getting the product name from the breadcrumb link but that pulls some unwanted characters which can be seen in the "replace" function.

    Code:
            'get Product Name
            Dim getProductName = doc.DocumentNode.SelectSingleNode("//*[@class='breadcrumbs']//li[5]").InnerText
            getProductName = getProductName.Replace(">", "")
            Console.WriteLine(getProductName)
    The above works fine, i can carry on with what i am doing but i'd like to know if there is a way of getting the product name.
    Last edited by lynx2011; Apr 30th, 2017 at 08:56 AM.

  2. #2
    Super Moderator jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    110,344

    Re: Any help with this formula [HTMLAgility-Pack]

    I've never used HTML AP so I don't know exactly what it can do and what it can't but I notice that the chevron in that <li> tag is itself inside a <span> tag, so perhaps you can get the InnerHtml rather than the InnerText and then exclude that span from that to get just the plain text content.

  3. #3

    Thread Starter
    Member
    Join Date
    Apr 2011
    Posts
    35

    Re: Any help with this formula [HTMLAgility-Pack]

    Quote Originally Posted by jmcilhinney View Post
    I've never used HTML AP so I don't know exactly what it can do and what it can't but I notice that the chevron in that <li> tag is itself inside a <span> tag, so perhaps you can get the InnerHtml rather than the InnerText and then exclude that span from that to get just the plain text content.
    Sorry, i am not too sure where you are looking?

    Code:
    <h2 class="heading-alpha">Rigel Shirt</h2>
    That's the current code via inspect element.

  4. #4
    Super Moderator jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    110,344

    Re: Any help with this formula [HTMLAgility-Pack]

    Quote Originally Posted by lynx2011 View Post
    Sorry, i am not too sure where you are looking?
    I'm looking at the part of the HTML that your code is retrieving, which is NOT the part that you just quoted. Maybe you should go back and look at it again. You are specifying that you want the element with a class of 'breadcrumbs'. The breadcrumbs are separated by '>' characters, which is why you're getting those characters. If you don't want the breadcrumbs then don't retrieve the breadcrumbs. If the 'div' whose contents you want has a class of 'heading-alpha' then that's probably the class you should use in your retrieval code.

  5. #5

    Thread Starter
    Member
    Join Date
    Apr 2011
    Posts
    35

    Re: Any help with this formula [HTMLAgility-Pack]

    Quote Originally Posted by jmcilhinney View Post
    I'm looking at the part of the HTML that your code is retrieving, which is NOT the part that you just quoted. Maybe you should go back and look at it again. You are specifying that you want the element with a class of 'breadcrumbs'. The breadcrumbs are separated by '>' characters, which is why you're getting those characters. If you don't want the breadcrumbs then don't retrieve the breadcrumbs. If the 'div' whose contents you want has a class of 'heading-alpha' then that's probably the class you should use in your retrieval code.
    Yeah i think i may have confused the reader by mentioning 2 things at once?

    So i am after the product name which is located in a H2 tag BUT whenever i try to pull the innertext from the tag it pulls the whole lot from the content column and its got me baffled?

    I am currently getting the product name from the breadcrumb link but that pulls some unwanted characters which can be seen in the "replace" function.
    Basically i am trying to get the product name from the H2 tag but as a work around i took it from the breadcrumb link with the added characters which i removed with the replace string (insert code from above) but what i would like to know is how to get the product name from the H2 tag.

    I think i should have put it that way instead of explaining both but then again, after reading it over it still sounds correct?

  6. #6
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,335

    Re: Any help with this formula [HTMLAgility-Pack]

    Quote Originally Posted by lynx2011 View Post
    Sorry, i am not too sure where you are looking?

    Code:
    <h2 class="heading-alpha">Rigel Shirt</h2>
    That's the current code via inspect element.
    It's interesting that what's returned by inspect element is a fixed version of the actual broken html. The real html closes the h2 tag with a h1 closing tag:
    HTML Code:
    <h2 class="heading-alpha">Rigel Shirt</h1>
    The agility pack also "fixes" the problem by removing the /h1 closing tag and inserting a /h2 closing tag, but it inserts the tag after the h2 header node's sibling nodes. This effectively turns those sibling nodes into child nodes of the h2 header node. That's why you see all the extra text when retrieving h2's inner text.


    I'm not all that familiar with the agility pack, but you could try something like:
    Code:
    Dim h2Node As HtmlNode = doc.DocumentNode.SelectSingleNode("//h2")
    Dim productName As String = h2Node.FirstChild.InnerText
    or:
    Code:
    Dim productName As String = doc.DocumentNode.SelectSingleNode("//h2/text()").InnerText
    Both the above seem to work in this particular case.

  7. #7

    Thread Starter
    Member
    Join Date
    Apr 2011
    Posts
    35

    Re: Any help with this formula [HTMLAgility-Pack]

    Quote Originally Posted by Inferrd View Post
    It's interesting that what's returned by inspect element is a fixed version of the actual broken html. The real html closes the h2 tag with a h1 closing tag:
    HTML Code:
    <h2 class="heading-alpha">Rigel Shirt</h1>
    The agility pack also "fixes" the problem by removing the /h1 closing tag and inserting a /h2 closing tag, but it inserts the tag after the h2 header node's sibling nodes. This effectively turns those sibling nodes into child nodes of the h2 header node. That's why you see all the extra text when retrieving h2's inner text.


    I'm not all that familiar with the agility pack, but you could try something like:
    Code:
    Dim h2Node As HtmlNode = doc.DocumentNode.SelectSingleNode("//h2")
    Dim productName As String = h2Node.FirstChild.InnerText
    or:
    Code:
    Dim productName As String = doc.DocumentNode.SelectSingleNode("//h2/text()").InnerText
    Both the above seem to work in this particular case.
    Ahh i see, thanks for explaining that!!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width