Results 1 to 5 of 5

Thread: From String to HTMLDocument.Links

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2006
    Posts
    675

    From String to HTMLDocument.Links

    I'm trying to write a function that can retrieve all the links from a webpage. I'd like to send only a string containing the URL. Basically, given a string of a URL, I'd like to "load" that into an HTMLDocument so I can access the Links collection. I just can't figure out that part.

    I've already written the function by using the Document in a WebBrowser. But, after selecting a link, I'd like to get its links, and so, and so on....while the user is still browsing the first page. Does anyone know how to do this?

    Thanks
    VB.Net 2008
    .Net Framework 2.0

    "Must you breathe? 'Cause I need heaven..."

  2. #2
    Karen Payne MVP kareninstructor's Avatar
    Join Date
    Jun 2008
    Location
    Oregon
    Posts
    6,714

    Re: From String to HTMLDocument.Links

    Are you using code similar to this?


    I tested it with this page
    in a web browser control

    The first section gets all links, the second one begins to drill down into a DIV tag where there is only one tag with the ID of content.

    Code:
          Dim HyperLinks As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("A")
          If HyperLinks.Count > 0 Then
             For Each link As HtmlElement In HyperLinks
                Console.WriteLine("text [{0}] - link [{1}]", link.InnerText, link.GetAttribute("href"))
             Next
          End If
    
          Dim ParentGroup = From p In WebBrowser1.Document.GetElementsByTagName("DIV").Cast(Of HtmlElement)() _
                      Where p.Id = "content"
    
          If ParentGroup.Count > 0 Then
             For Each item As HtmlElement In ParentGroup
                If item.Id = "content" Then
                   Console.WriteLine("GOT IT")
                   ' continue here
                End If
             Next
          End If

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2006
    Posts
    675

    Re: From String to HTMLDocument.Links

    Currently, I'm looping through the links in a WebBrowser:

    For Each element As HtmlElement In browser.Document.Links
    'Do whatever
    Next

    What I need to do is select a link, then get the HTML for it, then go through the link's links, grab one, get the HTML, go through that link's links, and so on....

    So, if you started, say, here:
    http://en.wikipedia.org/wiki/Visual_Basic_.NET

    It would go though the page and grab a random link and go there, for instance:
    http://en.wikipedia.org/wiki/Backward_compatibility

    Then it would grab a random link from the Backward Compatibility page, for instance:
    http://en.wikipedia.org/wiki/PCI_Express

    and so on for as many level as the user specifies. The thing is, it should be doing this while the user is still viewing the first page (http://en.wikipedia.org/wiki/Visual_Basic_.NET). I already have it doing this, but only for one level because I can't change the URL of the browser without disrupting the user's viewing of that page. So, I'd like to take a URL and "load" it into a HTMLDocument. At least, that's what I think I want to do, I'm open to suggestions.
    VB.Net 2008
    .Net Framework 2.0

    "Must you breathe? 'Cause I need heaven..."

  4. #4
    Karen Payne MVP kareninstructor's Avatar
    Join Date
    Jun 2008
    Location
    Oregon
    Posts
    6,714

    Re: From String to HTMLDocument.Links

    This should be interesting for you.

    Try the following (which is going to take work on your end, not a simple solution)

    http://htmlagilitypack.codeplex.com/

  5. #5

    Thread Starter
    Fanatic Member
    Join Date
    Nov 2006
    Posts
    675

    Re: From String to HTMLDocument.Links

    Thanks, I'll take a look at that link. However, I solved the problem. Luckily I also stumbled upon WebClient, which probably (though I'm not sure) allows me to go straight to the HTML without having to download any images or anything. This seems to work pretty fast.

    I'm not entirely certain this works, because I do a lot of other stuff which I ripped out to give the simplest solution here, and I didn't test this exact code. But, the important stuff is there. I'm getting errors while testing it, but I just figured this out a few minutes ago.

    vb Code:
    1. Private Sub FindAllLinksByRegex(ByVal strSource As String, ByVal Level As Integer)
    2.         Dim strHTML As String
    3.         Dim wc As New WebClient
    4.         Dim arrayKeep As New ArrayList
    5.         Dim arrayToss As New ArrayList
    6.         Dim rnd As New Random
    7.         Dim strLink As String
    8.         Dim strSelection As String
    9.  
    10.         'Get the HTML text
    11.         strHTML = wc.DownloadString(strSource)
    12.  
    13.         'Use Regex to find all tags (I think, I don't really understand Regex)
    14.         Dim m1 As MatchCollection = Regex.Matches(strHTML, "(<a.*?>.*?</a>)", RegexOptions.Singleline)
    15.  
    16.         For Each m As Match In m1
    17.             'Second Regex to ensure it is a link (once again, I think)
    18.             Dim m2 As Match = Regex.Match(m.Groups(1).Value, "href=\""(.*?)\""", RegexOptions.Singleline)
    19.             If m2.Success Then
    20.                 strLink = m2.Groups(1).Value
    21.  
    22.                 'Don't grab a link to the page we're already on
    23.                 If strLink <> Me.WebBrowser1.Url.ToString Then
    24.                     arrayKeep.Add(strLink)
    25.                 Else
    26.                     arrayToss.Add(strLink)
    27.                 End If
    28.             End If
    29.         Next
    30.  
    31.         If arrayKeep.Count <> 0 Then
    32.             'Grab a link at random from the ArrayList
    33.             strSelection = arrayKeep(rnd.Next Mod arrayKeep.Count)
    34.  
    35.             'Recursively call the function while decrementing Level
    36.             If Level > 0 Then Me.FindAllLinksByRegex(strSelection, Level - 1)
    37.         End If
    38.     End Sub
    VB.Net 2008
    .Net Framework 2.0

    "Must you breathe? 'Cause I need heaven..."

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width