|
-
Jul 20th, 2004, 03:31 PM
#1
Thread Starter
New Member
Help crawling and extracting links...
Hello,
I'm new to vb.net, so please go easy on me. I'm creating an application for our team and I need to crawl an entire website and use the links to create a TreeView of the website structure, including files and directories.
I've managed to pull all the links from a single page into a TreeView, but I need help crawling the entire site, given just the domain name. Can anyone shed some light?
Thank you!
-
Jul 20th, 2004, 05:41 PM
#2
Lively Member
I have no idea how to do that, however, you might want to watch out for offsite links.
-
Jul 20th, 2004, 10:44 PM
#3
Thread Starter
New Member
Yes, obviously I don't want to crawl outside links. I figured I would check the domain before crawling.
Has no one done this before?
-
Jul 26th, 2004, 10:41 AM
#4
Hyperactive Member
skanxalot:
How did you extract the links from the HTML document? Is there a method that is already written to do this or did you simple write an extractor that searches for the "<a" in the file that signifies a link?
Thanks,
Eric
-
Jul 26th, 2004, 04:24 PM
#5
Thread Starter
New Member
There is a method...getElementsByTagName
This returns a collection of the specified element, in my case "A".
-
Jul 27th, 2004, 09:13 AM
#6
Junior Member
Are you using the web browser control to browse the page and parse the links? Because idealy you could forgoe any controls and do this all in code using the System.Net namespace and regular expressions. Either way I would recommend using regular expressions to parse out the link structures.
Chris
-
Jul 27th, 2004, 09:22 AM
#7
Hyperactive Member
Thanks for your replies...
I am wanting to download a site and build a site map of all the internal and external links for building a site map with. I hanv't done much with regular expressions but they seemed pretty complicated.
I would suspect that the XML object and methods would be slower than the regular expressions.
Is there a good tutorial on the web that you recommend?
Thanks,
Eric
-
Jul 27th, 2004, 11:02 AM
#8
Thread Starter
New Member
Yes, I am using Web Browser Control. Thanks for the suggestions, I would also love to see a tutorial if one is available.
-
Jul 27th, 2004, 01:50 PM
#9
Junior Member
Personally I would recommend using System.Net to download the pages, this removes the need to redistribute the wrappers for the internet explorer control (Unless you are using .net 2.0 which has a built in wrapper for it). If you need some sample code of that let me know.
If you have the msdn help installed you can look up these topics for regular expressions (the url's will only work for the msdn library july edition):
1) .NET Development > .NET Framework SDK > Programming with the .NET Framework > Working with Base Types > Manipulating Strings > .NET Framework Regular Expressions > Regular Expression Samples
{ms-help://MS.MSDNQTR.2004JUL.1033/cpguide/html/cpconRegularExpressionExamples.htm}
2) .NET Development > .NET Framework SDK > Reference > Regular Expression Language Elements
{ms-help://MS.MSDNQTR.2004JUL.1033/cpgenref/html/cpconRegularExpressionsLanguageElements.htm}
Personally I bought a book called Mastering Regular Expressions by Jeffrey Friedl because the one thing I was looking for I couldn't find in the help but the general concepts I did figure out using these help documents. If you have trouble with something specific on it let me know.
Chris
Last edited by balmerch; Jul 27th, 2004 at 01:54 PM.
-
Jul 27th, 2004, 08:29 PM
#10
Frenzied Member
i'd like to know how to collect .jpg files from a site. has anyone successfully done this?
-
Jul 27th, 2004, 10:14 PM
#11
Frenzied Member
http://msdn.microsoft.com/msdnmag/is...T/default.aspx
check out that link. It's pretty detailed on how to do just what you want.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|