Parsing meta data from urls in a text file.

**jbBigCake** · Nov 23rd, 2011, 12:06 PM

Here's my idea. I want to parse the meta data (keywords and descriptions) from a list of urls in a text file.

The urls are formatted in the text file in this format:
http://www.dogsdogsdogs.com
http://www.ilikedogs.com
http://www.didyouseethatdog.com

I'm not really worried about where I'm going to store the results, I'm deciding between a separate text file or an excel table. My problem has been the parsing, I've been having trouble coding this on my own for awhile so i decided to ask the community. Any ideas? Thanks for any input.

**Hack** · Nov 23rd, 2011, 12:26 PM

Moved From The Codebank (which is for sharing code with others rather than asking questions )

**dilettante** · Nov 23rd, 2011, 12:36 PM

You don't mean "parse from URLs" since that makes no sense, so I assume you mean to retrieve the default HTML pages from those sites and parse that.

Nothing new here, you're talking about simple Web scraping. There must be a ton of threads here and elsewhere on that. Maybe you were just using the wrong keywords for your searches?

Parsing HTML is in theory just like parsing XML, however there are some extra issues. HTML pages can often be "dirty" and display properly even when the HTML is imperfect, uses tags like <BR> that have no close tag, etc.

Also, to defeat Web-scraping pirates more and more sites are omitting important blocks of data from the HTML and using embedded or external scripts to generate or fetch and insert this data. That means the HTML is fairly useless by itself, and has to be run through a script-enabled page renderer. Even then that may fail if they use frames and iframes cleverly.

Thread: Parsing meta data from urls in a text file.

Thread Tools

Display

Parsing meta data from urls in a text file.

Re: Parsing meta data from urls in a text file.

Re: Parsing meta data from urls in a text file.

Posting Permissions