|
-
Nov 23rd, 2011, 12:06 PM
#1
Thread Starter
New Member
Parsing meta data from urls in a text file.
Here's my idea. I want to parse the meta data (keywords and descriptions) from a list of urls in a text file.
The urls are formatted in the text file in this format:
http://www.dogsdogsdogs.com
http://www.ilikedogs.com
http://www.didyouseethatdog.com
I'm not really worried about where I'm going to store the results, I'm deciding between a separate text file or an excel table. My problem has been the parsing, I've been having trouble coding this on my own for awhile so i decided to ask the community. Any ideas? Thanks for any input.
-
Nov 23rd, 2011, 12:26 PM
#2
Re: Parsing meta data from urls in a text file.
Moved From The Codebank (which is for sharing code with others rather than asking questions )
-
Nov 23rd, 2011, 12:36 PM
#3
Re: Parsing meta data from urls in a text file.
You don't mean "parse from URLs" since that makes no sense, so I assume you mean to retrieve the default HTML pages from those sites and parse that.
Nothing new here, you're talking about simple Web scraping. There must be a ton of threads here and elsewhere on that. Maybe you were just using the wrong keywords for your searches?
Parsing HTML is in theory just like parsing XML, however there are some extra issues. HTML pages can often be "dirty" and display properly even when the HTML is imperfect, uses tags like <BR> that have no close tag, etc.
Also, to defeat Web-scraping pirates more and more sites are omitting important blocks of data from the HTML and using embedded or external scripts to generate or fetch and insert this data. That means the HTML is fairly useless by itself, and has to be run through a script-enabled page renderer. Even then that may fail if they use frames and iframes cleverly.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|