|
-
Mar 1st, 2010, 11:55 AM
#1
Thread Starter
Fanatic Member
From String to HTMLDocument.Links
I'm trying to write a function that can retrieve all the links from a webpage. I'd like to send only a string containing the URL. Basically, given a string of a URL, I'd like to "load" that into an HTMLDocument so I can access the Links collection. I just can't figure out that part.
I've already written the function by using the Document in a WebBrowser. But, after selecting a link, I'd like to get its links, and so, and so on....while the user is still browsing the first page. Does anyone know how to do this?
Thanks
VB.Net 2008
.Net Framework 2.0
"Must you breathe? 'Cause I need heaven..."
-
Mar 1st, 2010, 01:05 PM
#2
Re: From String to HTMLDocument.Links
Are you using code similar to this?
I tested it with this page
in a web browser control
The first section gets all links, the second one begins to drill down into a DIV tag where there is only one tag with the ID of content.
Code:
Dim HyperLinks As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("A")
If HyperLinks.Count > 0 Then
For Each link As HtmlElement In HyperLinks
Console.WriteLine("text [{0}] - link [{1}]", link.InnerText, link.GetAttribute("href"))
Next
End If
Dim ParentGroup = From p In WebBrowser1.Document.GetElementsByTagName("DIV").Cast(Of HtmlElement)() _
Where p.Id = "content"
If ParentGroup.Count > 0 Then
For Each item As HtmlElement In ParentGroup
If item.Id = "content" Then
Console.WriteLine("GOT IT")
' continue here
End If
Next
End If
-
Mar 1st, 2010, 01:14 PM
#3
Thread Starter
Fanatic Member
Re: From String to HTMLDocument.Links
Currently, I'm looping through the links in a WebBrowser:
For Each element As HtmlElement In browser.Document.Links
'Do whatever
Next
What I need to do is select a link, then get the HTML for it, then go through the link's links, grab one, get the HTML, go through that link's links, and so on....
So, if you started, say, here:
http://en.wikipedia.org/wiki/Visual_Basic_.NET
It would go though the page and grab a random link and go there, for instance:
http://en.wikipedia.org/wiki/Backward_compatibility
Then it would grab a random link from the Backward Compatibility page, for instance:
http://en.wikipedia.org/wiki/PCI_Express
and so on for as many level as the user specifies. The thing is, it should be doing this while the user is still viewing the first page (http://en.wikipedia.org/wiki/Visual_Basic_.NET). I already have it doing this, but only for one level because I can't change the URL of the browser without disrupting the user's viewing of that page. So, I'd like to take a URL and "load" it into a HTMLDocument. At least, that's what I think I want to do, I'm open to suggestions.
VB.Net 2008
.Net Framework 2.0
"Must you breathe? 'Cause I need heaven..."
-
Mar 1st, 2010, 02:02 PM
#4
Re: From String to HTMLDocument.Links
This should be interesting for you.
Try the following (which is going to take work on your end, not a simple solution)
http://htmlagilitypack.codeplex.com/
-
Mar 1st, 2010, 02:28 PM
#5
Thread Starter
Fanatic Member
Re: From String to HTMLDocument.Links
Thanks, I'll take a look at that link. However, I solved the problem. Luckily I also stumbled upon WebClient, which probably (though I'm not sure) allows me to go straight to the HTML without having to download any images or anything. This seems to work pretty fast.
I'm not entirely certain this works, because I do a lot of other stuff which I ripped out to give the simplest solution here, and I didn't test this exact code. But, the important stuff is there. I'm getting errors while testing it, but I just figured this out a few minutes ago.
vb Code:
Private Sub FindAllLinksByRegex(ByVal strSource As String, ByVal Level As Integer) Dim strHTML As String Dim wc As New WebClient Dim arrayKeep As New ArrayList Dim arrayToss As New ArrayList Dim rnd As New Random Dim strLink As String Dim strSelection As String 'Get the HTML text strHTML = wc.DownloadString(strSource) 'Use Regex to find all tags (I think, I don't really understand Regex) Dim m1 As MatchCollection = Regex.Matches(strHTML, "(<a.*?>.*?</a>)", RegexOptions.Singleline) For Each m As Match In m1 'Second Regex to ensure it is a link (once again, I think) Dim m2 As Match = Regex.Match(m.Groups(1).Value, "href=\""(.*?)\""", RegexOptions.Singleline) If m2.Success Then strLink = m2.Groups(1).Value 'Don't grab a link to the page we're already on If strLink <> Me.WebBrowser1.Url.ToString Then arrayKeep.Add(strLink) Else arrayToss.Add(strLink) End If End If Next If arrayKeep.Count <> 0 Then 'Grab a link at random from the ArrayList strSelection = arrayKeep(rnd.Next Mod arrayKeep.Count) 'Recursively call the function while decrementing Level If Level > 0 Then Me.FindAllLinksByRegex(strSelection, Level - 1) End If End Sub
VB.Net 2008
.Net Framework 2.0
"Must you breathe? 'Cause I need heaven..."
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|