Web Crawler

**dj4uk** · Mar 30th, 2004, 03:30 AM

Has anyone done anything like the following or has any ideas how I could go about developing it.

I'd like to come up with a competitor analysis tool that basically checks for new content on specified URLs at specific time intervals (days rather than minutes or hours). If there are any changes then these should be email as attachments to an email. The difficult bit as far as I'm concerned is ensuring that only changes in the main content cause an action and not just a different advert on a page!

I envisage that this will be a console or perhaps windows form application.

Any help would be appreciated.

DJ

**JAtkinson** · Mar 30th, 2004, 01:42 PM

Not sure about the second part... but you could download the html and save it to a file. Then load it to a string and check to see if the page content == the old content. There may be a much easier way to do this though.

**Redth** · Mar 31st, 2004, 01:09 PM

not sure if you can get the file's timestamp from the server in the http header some how, first i'd look into that... if that doesn't work which i doubt it will, download the html, hash the html file, and compare it against an older hash if it's different, well, then it's been changed...

**MrPolite** · Apr 1st, 2004, 03:04 AM

getting the hash code and such shouldnt work, because every time you view the page there should be a small change in the HTML stuff (because of advertisements)....
well I would say that the upates on a site will most-likely be text only. They add/remove a link or two or add a bunch of text to the page. Images are usually advertisement, and I would say the CONTENT of a page hardly gets updated with images alone....
so a simple suggestion would be to read the html file, ignore all the image and other html tags (table, etc) and try to measure the length (in characters) of the visible text that appears on the html page. If there is an update to the page (which will again most probably be an update to the visible text on the page), then the length of that text would change....
if you dont know what I'm saying ask me to explain more

it seems like a simple solution that should work. anything more complicated than that which would actually examine the page word by word is out of the scope of my IQ

Thread: Web Crawler

Thread Tools

Display

Web Crawler

Posting Permissions