I was just thinking that it wouldnt actually be all that difficult to map out the internet, if you used a distributed computing approach to the problem.

Consider if you will, a client application.
The client application scans the hdd for URLs, or starts off at a given URL, and then scans that URL for links, and follows those links, and scans those pages for links ... so on and so forth.

It could also try a few smart things too.
Eg. If it was given a URL www.host.com/somedir/somefile.htm, it could try to get a directory listing of /somedir/, and then it could also try to get the index page from /

The application follows these links until it is a certain depth down from the starting point, or until its run out of memory or something. Then every so often it connects to a central server, and uploads a compressed version of its findings.

Now, if you took maybe 10 computers, and gave the 10 computers directory sites to start off on, and told those people to tell other people etc, then you'd end up getting a lot of data very fast.

One thing though, there are lots of free webspace providers with thousands upon thousands of users these days.
So if the app came across *.geocities.com, it could just record that geocities.com was a valid host.

Then we make another special app that is designed for webspace providers. The app would use the site's search engine to scan for pages contained on the site. Or if the site had a directory then even better.


So its not all that implausable.
If ya got a lot of people running this app, which I might add, would sit in your system try and not bother you, then you'd get a lot of data very fast.

Whaddaya think ?