|
-
Nov 4th, 2011, 11:16 AM
#1
Thread Starter
Hyperactive Member
[RESOLVED] Please help with this code !!!!!!!
I found this code but I am developing a small sitemap builder for submit on yahoo so I need save all internal urls of a website in a listbox or richtextbox. Can you help me to change this code?
Code:
Public Function ExtractLinks(ByVal url As String) As DataTable
Dim dt As New DataTable
dt.Columns.Add("LinkText")
dt.Columns.Add("LinkUrl")
Dim wc As New WebClient
Dim html As String = wc.DownloadString(url)
Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
For Each match As Match In links
Dim dr As DataRow = dt.NewRow
Dim matchUrl As String = match.Groups(1).Value
'Ignore all anchor links
If matchUrl.StartsWith("#") Then
Continue For
End If
'Ignore all javascript calls
If matchUrl.ToLower.StartsWith("javascript:") Then
Continue For
End If
'Ignore all email links
If matchUrl.ToLower.StartsWith("mailto:") Then
Continue For
End If
'For internal links, build the url mapped to the base address
If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
matchUrl = MapUrl(url, matchUrl)
End If
'Add the link data to datatable
dr("LinkUrl") = matchUrl
dr("LinkText") = match.Groups(2).Value
dt.Rows.Add(dr)
Next
Return dt
End Function
Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
Dim u As New System.Uri(baseAddress)
If relativePath = "./" Then
relativePath = "/"
End If
If relativePath.StartsWith("/") Then
Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
Else
Dim pathAndQuery As String = u.AbsolutePath
' If the baseAddress contains a file name, like ..../Something.aspx
' Trim off the file name
pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
End If
baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
'If the relativePath contains ../ then
' adjust the baseAddress accordingly
While relativePath.StartsWith("../")
relativePath = relativePath.Substring(3)
If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
End If
End While
Return baseAddress + "/" + relativePath
End If
End Function
Last edited by romanos8; Nov 4th, 2011 at 11:21 AM.
-
Nov 4th, 2011, 10:47 PM
#2
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
Someone to know how resolve this problem?
-
Nov 5th, 2011, 08:20 AM
#3
Re: Please help with this code !!!!!!!
Hey,
The code you posted has a lot of functionality. You say that you "found" it, so I wonder if you actually need all that functionality?
To get you started, does this come close?
(assumes you have a webbrowser control on your form with the webpage loaded into it).
vb Code:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
For Each el As HtmlElement In WebBrowser1.Document.Links
ListBox1.Items.Add(el.GetAttribute("href"))
Next
End Sub
It doesn't return image links as it is, but that should be easy to resolve.
-
Nov 5th, 2011, 08:31 AM
#4
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
Thanks for your help, yes I need all that functionality but I need save the resul in a listbox, so your code will work with this code? or is other way for collect the urls?
I want to build a sitemap builder for submit on yahoo for index webpages.
 Originally Posted by Inferrd
Hey,
The code you posted has a lot of functionality. You say that you "found" it, so I wonder if you actually need all that functionality?
To get you started, does this come close?
(assumes you have a webbrowser control on your form with the webpage loaded into it).
vb Code:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
For Each el As HtmlElement In WebBrowser1.Document.Links
ListBox1.Items.Add(el.GetAttribute("href"))
Next
End Sub
It doesn't return image links as it is, but that should be easy to resolve.
-
Nov 5th, 2011, 08:40 AM
#5
Re: Please help with this code !!!!!!!
OK, that code uses regex and datatables and unfortunately I have very little understanding of either. Sorry
-
Nov 5th, 2011, 09:00 AM
#6
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
Thanks for all your help.
 Originally Posted by Inferrd
OK, that code uses regex and datatables and unfortunately I have very little understanding of either. Sorry 
-
Nov 5th, 2011, 09:06 AM
#7
Addicted Member
Re: Please help with this code !!!!!!!
hi first add a ListView control then run this code
vb Code:
Imports System.Net Imports System.Text.RegularExpressions Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load ListView1.FullRowSelect = True ListView1.GridLines = True ListView1.Columns.Add("LinkUrl") ListView1.Columns.Add("LinkText") ListView1.View = View.Details End Sub Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click ExtractLinks("http://www.ebay.com/") For i% = 0 To ListView1.Columns.Count - 1 ListView1.Columns(i).Width = -2 Next i End Sub Public Function ExtractLinks(ByVal url As String) As Boolean Try Dim wc As New WebClient Dim html As String = wc.DownloadString(url) Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>") For Each match As Match In links Dim matchUrl As String = match.Groups(1).Value 'Ignore all anchor links If matchUrl.StartsWith("#") Then Continue For End If 'Ignore all javascript calls If matchUrl.ToLower.StartsWith("javascript:") Then Continue For End If 'Ignore all email links If matchUrl.ToLower.StartsWith("mailto:") Then Continue For End If 'For internal links, build the url mapped to the base address If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then matchUrl = MapUrl(url, matchUrl) End If Dim _lsvItm As New ListViewItem(New String() {matchUrl, match.Groups(2).Value}) ListView1.Items.Add(_lsvItm) Next Return True Catch ex As Exception Return False End Try End Function Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String Dim u As New System.Uri(baseAddress) If relativePath = "./" Then relativePath = "/" End If If relativePath.StartsWith("/") Then Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath Else Dim pathAndQuery As String = u.AbsolutePath ' If the baseAddress contains a file name, like ..../Something.aspx ' Trim off the file name pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/") If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/")) End If baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery While relativePath.StartsWith("../") relativePath = relativePath.Substring(3) If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/") End If End While Return baseAddress + "/" + relativePath End If End Function End Class
-
Nov 5th, 2011, 10:20 AM
#8
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
 Originally Posted by medsont
hi first add a ListView control then run this code
vb Code:
Imports System.Net
Imports System.Text.RegularExpressions
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
ListView1.FullRowSelect = True
ListView1.GridLines = True
ListView1.Columns.Add("LinkUrl")
ListView1.Columns.Add("LinkText")
ListView1.View = View.Details
End Sub
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
ExtractLinks("http://www.ebay.com/")
For i% = 0 To ListView1.Columns.Count - 1
ListView1.Columns(i).Width = -2
Next i
End Sub
Public Function ExtractLinks(ByVal url As String) As Boolean
Try
Dim wc As New WebClient
Dim html As String = wc.DownloadString(url)
Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
For Each match As Match In links
Dim matchUrl As String = match.Groups(1).Value
'Ignore all anchor links
If matchUrl.StartsWith("#") Then
Continue For
End If
'Ignore all javascript calls
If matchUrl.ToLower.StartsWith("javascript:") Then
Continue For
End If
'Ignore all email links
If matchUrl.ToLower.StartsWith("mailto:") Then
Continue For
End If
'For internal links, build the url mapped to the base address
If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
matchUrl = MapUrl(url, matchUrl)
End If
Dim _lsvItm As New ListViewItem(New String() {matchUrl, match.Groups(2).Value})
ListView1.Items.Add(_lsvItm)
Next
Return True
Catch ex As Exception
Return False
End Try
End Function
Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
Dim u As New System.Uri(baseAddress)
If relativePath = "./" Then
relativePath = "/"
End If
If relativePath.StartsWith("/") Then
Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
Else
Dim pathAndQuery As String = u.AbsolutePath
' If the baseAddress contains a file name, like ..../Something.aspx
' Trim off the file name
pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
End If
baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
While relativePath.StartsWith("../")
relativePath = relativePath.Substring(3)
If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
End If
End While
Return baseAddress + "/" + relativePath
End If
End Function
End Class
Thanks this code work like a charm but I have a question , this code harvest unlimited url for example is a site has 5000 url? Thanks.
-
Nov 5th, 2011, 10:27 AM
#9
Addicted Member
Re: Please help with this code !!!!!!!
can u give me that web page link?...
-
Nov 5th, 2011, 10:31 AM
#10
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
 Originally Posted by medsont
can u give me that web page link?...
Is just a example can I use this code with any site included a large website for extract all urls?
-
Nov 5th, 2011, 10:33 AM
#11
Addicted Member
Re: Please help with this code !!!!!!!
Is just a example can I use this code with any site included a large website for extract all urls?
sure.......
-
Nov 5th, 2011, 10:46 AM
#12
Thread Starter
Hyperactive Member
Re: Please help with this code !!!!!!!
-
Nov 5th, 2011, 10:47 AM
#13
Addicted Member
Re: [RESOLVED] Please help with this code !!!!!!!
u r welcome n don't forgot to mark as Resolved this thread......
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|