Hi Experts Again,
Is there a way to Get The HyperLinks From an Internet Page.
Printable View
Hi Experts Again,
Is there a way to Get The HyperLinks From an Internet Page.
just read the HTML into a string, then use Regex in order to get the links...
read it into a string
http://www.vbforums.com/showthread.php?t=372593
regex example getting things between <p> tags...
http://www.vbforums.com/showthread.php?t=391698
Here, this will find all links on a page, took me a while to write.. :bigyello:
VB Code:
Option Strict On Option Explicit On Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load Dim WebParse As New WebPageLinks("http://vbforums.com/") Dim URLs As Specialized.StringCollection = WebParse.Execute() For Each WebAddress As String In URLs ListBox1.Items.Add(WebAddress) Next End Sub End Class Public Class WebPageLinks Dim Web As String Public Function Execute() As Specialized.StringCollection Dim Inet As New Net.WebClient Dim ColLinks As New Specialized.StringCollection Dim WebText As New IO.StreamReader(Inet.OpenRead(Web)) Dim Parse As String Dim Domain As String ColLinks.AddRange(Microsoft.VisualBasic.Split(WebText.ReadToEnd.ToString, "www.")) ColLinks.RemoveAt(0) For t As Int32 = 0 To ColLinks.Count - 1 Parse = ColLinks(t).Substring(0, ColLinks(t).IndexOf(".") + 4) Domain = Parse Do Try Parse = ColLinks(t).Substring(0, Parse.Length + 1) IO.Path.GetFileName(Parse) Catch ex As Exception Parse = Domain Exit Do End Try Loop Until Parse.Chars(Parse.Length - 4) = "." ColLinks(t) = "www." & Parse Next Return ColLinks End Function Public Sub New(ByVal Website As String) Web = Website End Sub End Class
Um, String.Split ? :)Quote:
Originally Posted by Remix
String.split does allow multiple letters..BTW I messed it up, only shows domain names ATM, so gimme a few.
Okay I fixed it :wave:
Amazing Code But, :(Quote:
Originally Posted by |2eM!x
it takes only ww.site.com
what if the HyperLink is www.site.com/index.php?showforum=xxx
Thanx to u all
And will check Regex it really Damn Fast
Cheer
Not to mention a hyperlink doesn't always have www in it. Some designers simply put ../Images/1.gif etc...or some sites sub domain is not www.
With Regex i tried this but i think there is a wrong type with RegularExpressions.Regex
VB Code:
Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click Dim returnstring As String returnstring = SearchPage("http://www.yahoo.com") Dim Regex As New System.Text.RegularExpressions.Regex("(?<=<a href=>).*?(?=</a>)") Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(returnstring) For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches MsgBox(FoundMatch.Value) Next End Sub 'Function Code... Private Function SearchPage(ByVal sURL As String) As String Dim client As System.Net.WebClient = New System.Net.WebClient Dim data As System.IO.Stream = client.OpenRead(sURL) Dim reader As System.IO.StreamReader = New System.IO.StreamReader(data) SearchPage = reader.ReadToEnd End Function
Quote:
Originally Posted by OMITT3D
i know but the code will work with sites those have links like
http://www.site.com/page
Look at the code to most websites google for example.
<a href=/intl/en/about.html>About Google</a>
<a href="/ads/">Advertising Programs</a>
<a href=/language_tools?hl=en>Language Tools</a>
<a href=/preferences?hl=en>Preferences</a>
Etc.
|2eM!x .. APPRECIATED Work and Worth rates.
But For Example: if i tried to get all Threads in the Forum in this section:
http://www.vbforums.com/forumdisplay.php?f=25
i will not Get any Thread, Sorry For Bother
Obviously look at the source
<a href="forumdisplay.php?f=8"><strong>API</strong></a> It needs to parse <a href= instead of www.
Then as usual it was my fault.
I did not check the source i was just moving the mouse through the link and in the Status Bar will appear the full link. i'm so silly . lol
cuz i thought i can grab the link from the source.
Then what if i used webrowser control and get the HyperLinks. is it possible ?
|2eM!x ur code is handy man
Did anyone see my first post?? Did you visit the links? You just modify the regex expression in the second link, instead of the <p>...</p> tags, change it to the link tags ex. <a href=...</a> ... you jsut read the page into a string (first link)... then parse out the links using regex (second link)... Conan is going about it the right way in his later post... (no doubt using those examples ;) )
Here is a small sample, using a simple regex expression that displays everything within the <a href=...> block (not including the link description and closing </a> tag), so you can see that it is parsing the right stuff... I do notice that yahoo has some weird links in there, as you will see if you run this sample. This can be easily modified to not include the beginning "<a href=" and the ending ">", but the below is so you can see full text that it is matching on...
VB Code:
Dim returnstring As String returnstring = SearchPage("http://www.yahoo.com") Dim Regex As New System.Text.RegularExpressions.Regex("<a href=.*?>") Dim Mymatches As System.Text.RegularExpressions.MatchCollection = Regex.Matches(returnstring) For Each FoundMatch As System.Text.RegularExpressions.Match In Mymatches MsgBox(FoundMatch.Value) Next MessageBox.Show("done!")
Shouldn't it be "<a href=""(.*)"">" ?
Or "<a href=""(.+?)"">" if you do not want to match blank link targets
I was just giving a general example to show that it does work...
thanx for you all.
Now I am Confused of this Thread, Wha Shall i Mark it, UnResolved or CLosed.
you tell us hehe :) do you have it working? Are you having problems with it??
Quote:
Originally Posted by gigemboy
Yes Sir,
becuase not all source have the full link like:
http://www.vbforums.com/showthread.php?t=395225
in the source it's like:
<a href="showthread.php?t=395225">
but i doubt maybe if there is a way to get the full link by using WebBrowser Control.
it is because that is the "link" that is displayed in the source code... cant change that... it will be the same as viewed in any browser or control...
works fine for meCode:<a(.*)href="http://(.*)google.com(.*)>(.*)</a>
that will only pull up the links with "google.com" in the name, and starting with "http://", which is not what he wanted...
google.com is there for an example :rolleyes:Quote:
Originally Posted by gigemboy
remove it to get a regular expression to match links
but the whole point is that he wants to rebuild the links into something he can just click. Some links do not include the entire link in the href parameter (relative links), so you would have to "build" a clickable link by appending "http://", domain name, sometimes the root folder the page is in, etc... so dont "roll" your eyes at me for you misunderstanding what he is wanting :)
View this thread for more info about the same kind of question, as well as an example of a screen scraper micrsoft project that has this type of functionality for a reference...
http://www.vbforums.com/showthread.php?t=396773
I guess he wants to make an app similar to this one: ... http://www.astanda.com