Scrape a href from HTML but the right one.
Hey guys I'm trying to scrape the right url from html file using webbrowser
I want to scrape this Href and navigate to it. But the problem is every other comment with reply is almost the same. So if I use to scrape hrefs and check the name it will give me the reply buttons of all the comments + the new comment button. Is there a way to grab this link only this one by it's Class name or something?
The One I need:
Code:
<a href="forums.php?op=post&p=1409951"><img src="/images/icons/comment_add.png" class="inline_icon" align="top"> New Comment</a>
The ones I don't need:
Code:
<a href="forums.php?op=post&p=1409971">Reply To This</a>
I'm trying to create my own browser and this should be a button short cut If I want to comment. Thanks a lot.
Re: Scrape a href from HTML but the right one.
Re: Scrape a href from HTML but the right one.
vb.net Code:
For each h as htmlElement in WebBrowser1.Document.GetElementsByTagName("a")
if h.InnerText = "Reply To This" AndAlso System.Text.RegularExpressions.Regex.Match(h.GetAttribute("href"), "forums\.php\?op=post&p=\d*?", System.Text.RegularExpressions.RegexOptions.IgnoreCase).Success Then
WebBrowser1.Navigate(h.GetAttribute("href"))
Exit For
End if
Next
Hope that helps.
Re: Scrape a href from HTML but the right one.
Quote:
Originally Posted by
J-Deezy
vb.net Code:
For each h as htmlElement in WebBrowser1.Document.GetElementsByTagName("a")
if h.InnerText = "Reply To This" AndAlso System.Text.RegularExpressions.Regex.Match(h.GetAttribute("href"), "forums\.php\?op=post&p=\d*?", System.Text.RegularExpressions.RegexOptions.IgnoreCase).Success Then
WebBrowser1.Navigate(h.GetAttribute("href"))
Exit For
End if
Next
Hope that helps.
Thanks, It looks like what I need but it won't navigate nothing happens :(
tried MsgBox to see if there is a link but no msgbox either
Re: Scrape a href from HTML but the right one.
Re: Scrape a href from HTML but the right one.
The most likely problem would be that either the Regex is not matching, or the inner text is incorrect.
First try this:
vb.net Code:
For each h As HtmlElement in WebBrowser1.Document.GetElementsByTagName("a")
if h.InnerText.ToLower.Contains("reply to this") then
msgbox("found the appropriate innertext")
if System.Text.RegularExpressions.Regex.Match(h.GetAttribute("href"), "forums\.php\?op=post&p=\d*?").Success Then
msgbox("match was successful")
WebBrowser1.Navigate(h.GetAttribute("href"))
Exit For
Else
msgbox("it's the match that's failing" & vbnewLine & h.GetAttribute("href"))
End If
End If
Next
And report back what messageboxes, if any, appear.
Re: Scrape a href from HTML but the right one.
it could be that "forums.php?op=post&p=1409971" is not a valid url might have to do WebBrowser1.Navigate("www.thewebsite.com/" & h.GetAttribute("href"))
Re: Scrape a href from HTML but the right one.
He said he DOESN'T want the ones that say "Reply to this", but the IF condition says "if it DOES contain that". Wouldn't it be "If Not h.InnerText.ToLower.Contains("reply to this") Then" ?
Also, +1 on the relative URL path. He'll need to append the whole domain name to it. To be even more accurate, before navigating, take the current URL, lop off the filename and append that to the beginning. The forums could be 9 directories deep for all we know.
Re: Scrape a href from HTML but the right one.
Sometimes the href in the page source doesn't have a whole URL, but when you access the href attribute it can come up with the entire appended URL, as I can't physically test this; you'll have to do some debugging.
@The incorrect button href, glanced at the thread, simple mistake and simple solution:
Code:
if h.InnerText.ToLower.Contains("reply to this") then
change to:
Code:
if h.InnerText.ToLower.Contains("new comment") then