I wish to understand how search engine crawlers travers websites, in terms of the pathways they take.:confused:
Printable View
I wish to understand how search engine crawlers travers websites, in terms of the pathways they take.:confused:
that's not helping me understand it.
I know it scrapes a webpage for links, ok, but next what, and after that what?
I would also assume it starts with some sort of seed site list, but if so wouldn't that leave web addresses unreachable in a void?
and if I were to assume it goes threw all scraped links, wouldn't that inflation jam up the crawler?
Then you first need to collect a list of all domains.
https://www.quora.com/How-do-I-scrap...p-level-domain