Results 1 to 11 of 11

Thread: Help crawling and extracting links...

  1. #1

    Thread Starter
    New Member
    Join Date
    Oct 2002
    Location
    Dallas, TX
    Posts
    9

    Help crawling and extracting links...

    Hello,

    I'm new to vb.net, so please go easy on me. I'm creating an application for our team and I need to crawl an entire website and use the links to create a TreeView of the website structure, including files and directories.

    I've managed to pull all the links from a single page into a TreeView, but I need help crawling the entire site, given just the domain name. Can anyone shed some light?

    Thank you!

  2. #2
    Lively Member
    Join Date
    Dec 2003
    Posts
    91
    I have no idea how to do that, however, you might want to watch out for offsite links.

  3. #3

    Thread Starter
    New Member
    Join Date
    Oct 2002
    Location
    Dallas, TX
    Posts
    9
    Yes, obviously I don't want to crawl outside links. I figured I would check the domain before crawling.

    Has no one done this before?

  4. #4
    Hyperactive Member
    Join Date
    Jul 2004
    Location
    Kansas, USA
    Posts
    352
    skanxalot:

    How did you extract the links from the HTML document? Is there a method that is already written to do this or did you simple write an extractor that searches for the "<a" in the file that signifies a link?

    Thanks,
    Eric

  5. #5

    Thread Starter
    New Member
    Join Date
    Oct 2002
    Location
    Dallas, TX
    Posts
    9
    There is a method...getElementsByTagName

    This returns a collection of the specified element, in my case "A".

  6. #6
    Junior Member
    Join Date
    Jul 2004
    Location
    Port Huron, Michigan
    Posts
    20
    Are you using the web browser control to browse the page and parse the links? Because idealy you could forgoe any controls and do this all in code using the System.Net namespace and regular expressions. Either way I would recommend using regular expressions to parse out the link structures.

    Chris

  7. #7
    Hyperactive Member
    Join Date
    Jul 2004
    Location
    Kansas, USA
    Posts
    352
    Thanks for your replies...
    I am wanting to download a site and build a site map of all the internal and external links for building a site map with. I hanv't done much with regular expressions but they seemed pretty complicated.

    I would suspect that the XML object and methods would be slower than the regular expressions.

    Is there a good tutorial on the web that you recommend?

    Thanks,
    Eric

  8. #8

    Thread Starter
    New Member
    Join Date
    Oct 2002
    Location
    Dallas, TX
    Posts
    9
    Yes, I am using Web Browser Control. Thanks for the suggestions, I would also love to see a tutorial if one is available.

  9. #9
    Junior Member
    Join Date
    Jul 2004
    Location
    Port Huron, Michigan
    Posts
    20
    Personally I would recommend using System.Net to download the pages, this removes the need to redistribute the wrappers for the internet explorer control (Unless you are using .net 2.0 which has a built in wrapper for it). If you need some sample code of that let me know.

    If you have the msdn help installed you can look up these topics for regular expressions (the url's will only work for the msdn library july edition):
    1) .NET Development > .NET Framework SDK > Programming with the .NET Framework > Working with Base Types > Manipulating Strings > .NET Framework Regular Expressions > Regular Expression Samples
    {ms-help://MS.MSDNQTR.2004JUL.1033/cpguide/html/cpconRegularExpressionExamples.htm}

    2) .NET Development > .NET Framework SDK > Reference > Regular Expression Language Elements
    {ms-help://MS.MSDNQTR.2004JUL.1033/cpgenref/html/cpconRegularExpressionsLanguageElements.htm}

    Personally I bought a book called Mastering Regular Expressions by Jeffrey Friedl because the one thing I was looking for I couldn't find in the help but the general concepts I did figure out using these help documents. If you have trouble with something specific on it let me know.

    Chris
    Last edited by balmerch; Jul 27th, 2004 at 01:54 PM.

  10. #10
    Frenzied Member
    Join Date
    Nov 2003
    Posts
    1,489
    i'd like to know how to collect .jpg files from a site. has anyone successfully done this?

  11. #11
    Frenzied Member
    Join Date
    Nov 2003
    Posts
    1,489
    http://msdn.microsoft.com/msdnmag/is...T/default.aspx

    check out that link. It's pretty detailed on how to do just what you want.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width