Results 1 to 4 of 4

Thread: Web Spiders

  1. #1

    Thread Starter
    Fanatic Member venerable bede's Avatar
    Join Date
    Sep 2002
    Location
    The mystic land of Geordies
    Posts
    1,018

    Web Spiders

    I have decided to give writing a web spider in either vb.net or c# for a personal project.

    Does anyone have any tips, URLs for me to look at so I can start reading up on it ?

    Kiss Kiss

    TIA

    Parksie

  2. #2
    PowerPoster
    Join Date
    Feb 2002
    Location
    Canada, Toronto
    Posts
    5,794

    Re: Web Spiders

    I don't know any URLs, but it's really quite simple.

    You start by downloading a web-page, you parse the page, and get all the HTML links.
    Add all the links to a Collection.
    Make a few threads that do the same thing, but you get the link to download from that same collection, and when you parse the page, add all the links to the same collection.

    The collection has to be a good one, because in seconds you will have have hundrets of links in it, and in a few minutes/hours, you will have hundrets of thausands of links. A database might be better, it's slower than the collection at the beginning, but faster on the long run.

    A spider is used to collect information on the web, if you make it look for specific data, then your collection won't grow so fast.

    Also, don't expect it to ever finish processing, because you start with one web-page, and you will end up with millions/billions of links.

    Eventually, when you have a lot of links, you will have to stop it looking for new links, and just process the information in the links in the database.

    When I made my web-spider, I noticed that the bottleneck for performance is the page parsing, make that as fast as possible. I made the page parsing in C++ and called it from VB, for performance reasons.

  3. #3
    Frenzied Member maged's Avatar
    Join Date
    Nov 2002
    Location
    Egypt
    Posts
    1,040

    Re: Web Spiders

    you will have hundrets of thausands of links. A database might be better, it's slower than the collection at the beginning, but faster on the long run.
    i think database saving is a must, it will help you easily to prevent going on the same links over and over. prevent double works

    for performance reasons about page parsing, i think strings are the curse of .net platform (regarding performance). so writing it in c++ is a good idea.

    good luck anyway

  4. #4
    Lively Member
    Join Date
    May 2000
    Location
    Iowa USA
    Posts
    118

    Re: Web Spiders


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width