help scraping data from website

    Jan 2005

    help scraping data from website


    I need help with scraping certain data of websites. the current code that i have works fine for some
    Dim bookmarkNodes As IEnumerable(Of HtmlNode) = htmlDocument.DocumentNode.SelectNodes("//a[@rel='bookmark' and @title and not(.//time[@class='entry-date' and not(@datetime='')])]")
    but some use the this code
    <h3 class="entry-title"><a href="https://somewebisteref" rel="bookmark">data that i want to scrap</a></h3> <div class="entry-meta">
    <a href="https://somewebsite" title="data i need">
    the title part of my code again works for some but not all.

    i would like to be able to still use the exsisting code that works and intergrate what i need into it.

    The scrape goes into a text box that formats the data to what i need.

    many thanks
    Mar 2011
    Re: help scraping data from website

    You need to look at the overall DOM, determine what pattern is specific enough to match only the items you want but not too specific that it will exclude some desired items.

    Right now, you are it too specific.

    Is it safe to say you want all anchor elements (<a />) that are direct children of a DOM element with the class ".entry-title"? Could you be more specific to say that you want all anchor elements that are direct children of heading 3 (<h3 />) elements with the class ".entry-title"?

    Once you get that down, it's just a matter of building the query selector. Unfortunately the business logic is not something we decide for you. Once you know what you want to grab, in plain English, give that to us and we can help you on the code side.
    Aug 2002

    Re: help scraping data from website

    What I always say is that if you can avoid scraping from a website, then do so. If the site owners don't want you to have the data, then they wouldn't put it on a website. If they do want you to have the data, then they might already have an API, or might be willing to write an API, for the site. An API is going to be VASTLY easier to work with than scraping a website.

    The problem is that there is something in the nature of web developers that they have to change the HTML every few months. First it's a <p>, then it's a button, then it's some goofy <div> with CSS to make it look like a button. Active websites change and change often. APIs are a way to get the data out such that computers can talk between themselves to transfer the raw data. Those tend not to change, whereas the website is just a visual representation that can change on a whim.
