Help crawling and extracting links...

**skanxalot** · Jul 20th, 2004, 03:31 PM

Hello,

I'm new to vb.net, so please go easy on me. I'm creating an application for our team and I need to crawl an entire website and use the links to create a TreeView of the website structure, including files and directories.

I've managed to pull all the links from a single page into a TreeView, but I need help crawling the entire site, given just the domain name. Can anyone shed some light?

Thank you!

**Synth3t1c** · Jul 20th, 2004, 05:41 PM

I have no idea how to do that, however, you might want to watch out for offsite links.

**skanxalot** · Jul 20th, 2004, 10:44 PM

Yes, obviously I don't want to crawl outside links. I figured I would check the domain before crawling.

Has no one done this before?

**flycast** · Jul 26th, 2004, 10:41 AM

skanxalot:

How did you extract the links from the HTML document? Is there a method that is already written to do this or did you simple write an extractor that searches for the "<a" in the file that signifies a link?

Thanks,
Eric

**skanxalot** · Jul 26th, 2004, 04:24 PM

There is a method...getElementsByTagName

This returns a collection of the specified element, in my case "A".

**balmerch** · Jul 27th, 2004, 09:13 AM

Are you using the web browser control to browse the page and parse the links? Because idealy you could forgoe any controls and do this all in code using the System.Net namespace and regular expressions. Either way I would recommend using regular expressions to parse out the link structures.

Chris

**flycast** · Jul 27th, 2004, 09:22 AM

Thanks for your replies...
I am wanting to download a site and build a site map of all the internal and external links for building a site map with. I hanv't done much with regular expressions but they seemed pretty complicated.

I would suspect that the XML object and methods would be slower than the regular expressions.

Is there a good tutorial on the web that you recommend?

Thanks,
Eric

**skanxalot** · Jul 27th, 2004, 11:02 AM

Yes, I am using Web Browser Control. Thanks for the suggestions, I would also love to see a tutorial if one is available.

**balmerch** · Jul 27th, 2004, 01:50 PM

Personally I would recommend using System.Net to download the pages, this removes the need to redistribute the wrappers for the internet explorer control (Unless you are using .net 2.0 which has a built in wrapper for it). If you need some sample code of that let me know.

If you have the msdn help installed you can look up these topics for regular expressions (the url's will only work for the msdn library july edition):
1) .NET Development > .NET Framework SDK > Programming with the .NET Framework > Working with Base Types > Manipulating Strings > .NET Framework Regular Expressions > Regular Expression Samples
{ms-help://MS.MSDNQTR.2004JUL.1033/cpguide/html/cpconRegularExpressionExamples.htm}

2) .NET Development > .NET Framework SDK > Reference > Regular Expression Language Elements
{ms-help://MS.MSDNQTR.2004JUL.1033/cpgenref/html/cpconRegularExpressionsLanguageElements.htm}

Personally I bought a book called Mastering Regular Expressions by Jeffrey Friedl because the one thing I was looking for I couldn't find in the help but the general concepts I did figure out using these help documents. If you have trouble with something specific on it let me know.

Chris

**Andy** · Jul 27th, 2004, 08:29 PM

i'd like to know how to collect .jpg files from a site. has anyone successfully done this?

**Andy** · Jul 27th, 2004, 10:14 PM

http://msdn.microsoft.com/msdnmag/is...T/default.aspx

check out that link. It's pretty detailed on how to do just what you want.

Thread: Help crawling and extracting links...

Thread Tools

Display

Help crawling and extracting links...

Posting Permissions