Results 1 to 4 of 4

Thread: Web Crawler

  1. #1

    Thread Starter
    Frenzied Member dj4uk's Avatar
    Join Date
    Aug 2002
    Location
    Birmingham, UK Lobotomies: 3
    Posts
    1,131

    Web Crawler

    Has anyone done anything like the following or has any ideas how I could go about developing it.

    I'd like to come up with a competitor analysis tool that basically checks for new content on specified URLs at specific time intervals (days rather than minutes or hours). If there are any changes then these should be email as attachments to an email. The difficult bit as far as I'm concerned is ensuring that only changes in the main content cause an action and not just a different advert on a page!

    I envisage that this will be a console or perhaps windows form application.

    Any help would be appreciated.

    DJ

  2. #2
    Lively Member JAtkinson's Avatar
    Join Date
    Feb 2004
    Location
    Richmond, VA
    Posts
    68
    Not sure about the second part... but you could download the html and save it to a file. Then load it to a string and check to see if the page content == the old content. There may be a much easier way to do this though.
    Visual Studio .net 2003 EA
    VB .net
    C#

  3. #3
    Fanatic Member Redth's Avatar
    Join Date
    May 2001
    Location
    Ontario, Canada
    Posts
    551
    not sure if you can get the file's timestamp from the server in the http header some how, first i'd look into that... if that doesn't work which i doubt it will, download the html, hash the html file, and compare it against an older hash if it's different, well, then it's been changed...

  4. #4
    l33t! MrPolite's Avatar
    Join Date
    Sep 2001
    Posts
    4,428
    getting the hash code and such shouldnt work, because every time you view the page there should be a small change in the HTML stuff (because of advertisements)....
    well I would say that the upates on a site will most-likely be text only. They add/remove a link or two or add a bunch of text to the page. Images are usually advertisement, and I would say the CONTENT of a page hardly gets updated with images alone....
    so a simple suggestion would be to read the html file, ignore all the image and other html tags (table, etc) and try to measure the length (in characters) of the visible text that appears on the html page. If there is an update to the page (which will again most probably be an update to the visible text on the page), then the length of that text would change....
    if you dont know what I'm saying ask me to explain more it seems like a simple solution that should work. anything more complicated than that which would actually examine the page word by word is out of the scope of my IQ
    rate my posts if they help ya!
    Extract thumbnail without reading the whole image file: (C# - VB)
    Apply texture to bitmaps: (C# - VB)
    Extended console library: (VB)
    Save JPEG with a certain quality (image compression): (C# - VB )
    VB.NET to C# conversion tips!!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width