Results 1 to 3 of 3

Thread: Parsing meta data from urls in a text file.

  1. #1

    Thread Starter
    New Member
    Join Date
    Nov 2011
    Posts
    1

    Parsing meta data from urls in a text file.

    Here's my idea. I want to parse the meta data (keywords and descriptions) from a list of urls in a text file.

    The urls are formatted in the text file in this format:
    http://www.dogsdogsdogs.com
    http://www.ilikedogs.com
    http://www.didyouseethatdog.com

    I'm not really worried about where I'm going to store the results, I'm deciding between a separate text file or an excel table. My problem has been the parsing, I've been having trouble coding this on my own for awhile so i decided to ask the community. Any ideas? Thanks for any input.

  2. #2
    I'm about to be a PowerPoster! Hack's Avatar
    Join Date
    Aug 2001
    Location
    Searching for mendhak
    Posts
    58,333

    Re: Parsing meta data from urls in a text file.

    Moved From The Codebank (which is for sharing code with others rather than asking questions )

  3. #3
    PowerPoster dilettante's Avatar
    Join Date
    Feb 2006
    Posts
    24,487

    Re: Parsing meta data from urls in a text file.

    You don't mean "parse from URLs" since that makes no sense, so I assume you mean to retrieve the default HTML pages from those sites and parse that.

    Nothing new here, you're talking about simple Web scraping. There must be a ton of threads here and elsewhere on that. Maybe you were just using the wrong keywords for your searches?


    Parsing HTML is in theory just like parsing XML, however there are some extra issues. HTML pages can often be "dirty" and display properly even when the HTML is imperfect, uses tags like <BR> that have no close tag, etc.

    Also, to defeat Web-scraping pirates more and more sites are omitting important blocks of data from the HTML and using embedded or external scripts to generate or fetch and insert this data. That means the HTML is fairly useless by itself, and has to be run through a script-enabled page renderer. Even then that may fail if they use frames and iframes cleverly.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width