From String to HTMLDocument.Links
I'm trying to write a function that can retrieve all the links from a webpage. I'd like to send only a string containing the URL. Basically, given a string of a URL, I'd like to "load" that into an HTMLDocument so I can access the Links collection. I just can't figure out that part.
I've already written the function by using the Document in a WebBrowser. But, after selecting a link, I'd like to get its links, and so, and so on....while the user is still browsing the first page. Does anyone know how to do this?
Thanks
Re: From String to HTMLDocument.Links
Are you using code similar to this?
I tested it with this page
in a web browser control
The first section gets all links, the second one begins to drill down into a DIV tag where there is only one tag with the ID of content.
Code:
Dim HyperLinks As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("A")
If HyperLinks.Count > 0 Then
For Each link As HtmlElement In HyperLinks
Console.WriteLine("text [{0}] - link [{1}]", link.InnerText, link.GetAttribute("href"))
Next
End If
Dim ParentGroup = From p In WebBrowser1.Document.GetElementsByTagName("DIV").Cast(Of HtmlElement)() _
Where p.Id = "content"
If ParentGroup.Count > 0 Then
For Each item As HtmlElement In ParentGroup
If item.Id = "content" Then
Console.WriteLine("GOT IT")
' continue here
End If
Next
End If
Re: From String to HTMLDocument.Links
Currently, I'm looping through the links in a WebBrowser:
For Each element As HtmlElement In browser.Document.Links
'Do whatever
Next
What I need to do is select a link, then get the HTML for it, then go through the link's links, grab one, get the HTML, go through that link's links, and so on....
So, if you started, say, here:
http://en.wikipedia.org/wiki/Visual_Basic_.NET
It would go though the page and grab a random link and go there, for instance:
http://en.wikipedia.org/wiki/Backward_compatibility
Then it would grab a random link from the Backward Compatibility page, for instance:
http://en.wikipedia.org/wiki/PCI_Express
and so on for as many level as the user specifies. The thing is, it should be doing this while the user is still viewing the first page (http://en.wikipedia.org/wiki/Visual_Basic_.NET). I already have it doing this, but only for one level because I can't change the URL of the browser without disrupting the user's viewing of that page. So, I'd like to take a URL and "load" it into a HTMLDocument. At least, that's what I think I want to do, I'm open to suggestions.
Re: From String to HTMLDocument.Links
This should be interesting for you.
Try the following (which is going to take work on your end, not a simple solution)
http://htmlagilitypack.codeplex.com/
Re: From String to HTMLDocument.Links
Thanks, I'll take a look at that link. However, I solved the problem. Luckily I also stumbled upon WebClient, which probably (though I'm not sure) allows me to go straight to the HTML without having to download any images or anything. This seems to work pretty fast.
I'm not entirely certain this works, because I do a lot of other stuff which I ripped out to give the simplest solution here, and I didn't test this exact code. But, the important stuff is there. I'm getting errors while testing it, but I just figured this out a few minutes ago.
vb Code:
Private Sub FindAllLinksByRegex(ByVal strSource As String, ByVal Level As Integer)
Dim strHTML As String
Dim wc As New WebClient
Dim arrayKeep As New ArrayList
Dim arrayToss As New ArrayList
Dim rnd As New Random
Dim strLink As String
Dim strSelection As String
'Get the HTML text
strHTML = wc.DownloadString(strSource)
'Use Regex to find all tags (I think, I don't really understand Regex)
Dim m1 As MatchCollection = Regex.Matches(strHTML, "(<a.*?>.*?</a>)", RegexOptions.Singleline)
For Each m As Match In m1
'Second Regex to ensure it is a link (once again, I think)
Dim m2 As Match = Regex.Match(m.Groups(1).Value, "href=\""(.*?)\""", RegexOptions.Singleline)
If m2.Success Then
strLink = m2.Groups(1).Value
'Don't grab a link to the page we're already on
If strLink <> Me.WebBrowser1.Url.ToString Then
arrayKeep.Add(strLink)
Else
arrayToss.Add(strLink)
End If
End If
Next
If arrayKeep.Count <> 0 Then
'Grab a link at random from the ArrayList
strSelection = arrayKeep(rnd.Next Mod arrayKeep.Count)
'Recursively call the function while decrementing Level
If Level > 0 Then Me.FindAllLinksByRegex(strSelection, Level - 1)
End If
End Sub