|
-
Mar 30th, 2004, 03:30 AM
#1
Thread Starter
Frenzied Member
Web Crawler
Has anyone done anything like the following or has any ideas how I could go about developing it.
I'd like to come up with a competitor analysis tool that basically checks for new content on specified URLs at specific time intervals (days rather than minutes or hours). If there are any changes then these should be email as attachments to an email. The difficult bit as far as I'm concerned is ensuring that only changes in the main content cause an action and not just a different advert on a page!
I envisage that this will be a console or perhaps windows form application.
Any help would be appreciated.
DJ
-
Mar 30th, 2004, 01:42 PM
#2
Lively Member
Not sure about the second part... but you could download the html and save it to a file. Then load it to a string and check to see if the page content == the old content. There may be a much easier way to do this though.
Visual Studio .net 2003 EA
VB .net
C#
-
Mar 31st, 2004, 01:09 PM
#3
Fanatic Member
not sure if you can get the file's timestamp from the server in the http header some how, first i'd look into that... if that doesn't work which i doubt it will, download the html, hash the html file, and compare it against an older hash if it's different, well, then it's been changed...
-
Apr 1st, 2004, 03:04 AM
#4
getting the hash code and such shouldnt work, because every time you view the page there should be a small change in the HTML stuff (because of advertisements)....
well I would say that the upates on a site will most-likely be text only. They add/remove a link or two or add a bunch of text to the page. Images are usually advertisement, and I would say the CONTENT of a page hardly gets updated with images alone....
so a simple suggestion would be to read the html file, ignore all the image and other html tags (table, etc) and try to measure the length (in characters) of the visible text that appears on the html page. If there is an update to the page (which will again most probably be an update to the visible text on the page), then the length of that text would change....
if you dont know what I'm saying ask me to explain more it seems like a simple solution that should work. anything more complicated than that which would actually examine the page word by word is out of the scope of my IQ
rate my posts if they help ya!
Extract thumbnail without reading the whole image file: (C# - VB)
Apply texture to bitmaps: (C# - VB)
Extended console library: (VB)
Save JPEG with a certain quality (image compression): (C# - VB )
VB.NET to C# conversion tips!!
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|