Results 1 to 13 of 13

Thread: [RESOLVED] Please help with this code !!!!!!!

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Resolved [RESOLVED] Please help with this code !!!!!!!

    I found this code but I am developing a small sitemap builder for submit on yahoo so I need save all internal urls of a website in a listbox or richtextbox. Can you help me to change this code?

    Code:
    Public Function ExtractLinks(ByVal url As String) As DataTable
            Dim dt As New DataTable
            dt.Columns.Add("LinkText")
            dt.Columns.Add("LinkUrl")
    
            Dim wc As New WebClient
            Dim html As String = wc.DownloadString(url)
    
            Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
    
            For Each match As Match In links
                Dim dr As DataRow = dt.NewRow
                Dim matchUrl As String = match.Groups(1).Value
                'Ignore all anchor links
                If matchUrl.StartsWith("#") Then
                    Continue For
                End If
                'Ignore all javascript calls
                If matchUrl.ToLower.StartsWith("javascript:") Then
                    Continue For
                End If
                'Ignore all email links
                If matchUrl.ToLower.StartsWith("mailto:") Then
                    Continue For
                End If
                'For internal links, build the url mapped to the base address
                If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
                    matchUrl = MapUrl(url, matchUrl)
                End If
                'Add the link data to datatable
                dr("LinkUrl") = matchUrl
                dr("LinkText") = match.Groups(2).Value
                dt.Rows.Add(dr)
            Next
    
            Return dt
        End Function
    
        Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
    
            Dim u As New System.Uri(baseAddress)
    
            If relativePath = "./" Then
                relativePath = "/"
            End If
    
            If relativePath.StartsWith("/") Then
                Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
            Else
                Dim pathAndQuery As String = u.AbsolutePath
                ' If the baseAddress contains a file name, like ..../Something.aspx
                ' Trim off the file name
                pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
                If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
                    pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
                End If
                baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
    
                'If the relativePath contains ../ then
                ' adjust the baseAddress accordingly
    
                While relativePath.StartsWith("../")
                    relativePath = relativePath.Substring(3)
                    If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
                        baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
                    End If
                End While
    
                Return baseAddress + "/" + relativePath
            End If
    
        End Function
    Last edited by romanos8; Nov 4th, 2011 at 11:21 AM.

  2. #2

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Re: Please help with this code !!!!!!!

    Someone to know how resolve this problem?

  3. #3
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,335

    Re: Please help with this code !!!!!!!

    Hey,
    The code you posted has a lot of functionality. You say that you "found" it, so I wonder if you actually need all that functionality?

    To get you started, does this come close?
    (assumes you have a webbrowser control on your form with the webpage loaded into it).
    vb Code:
    1. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    2.     For Each el As HtmlElement In WebBrowser1.Document.Links
    3.         ListBox1.Items.Add(el.GetAttribute("href"))
    4.     Next
    5. End Sub

    It doesn't return image links as it is, but that should be easy to resolve.

  4. #4

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Exclamation Re: Please help with this code !!!!!!!

    Thanks for your help, yes I need all that functionality but I need save the resul in a listbox, so your code will work with this code? or is other way for collect the urls?
    I want to build a sitemap builder for submit on yahoo for index webpages.
    Quote Originally Posted by Inferrd View Post
    Hey,
    The code you posted has a lot of functionality. You say that you "found" it, so I wonder if you actually need all that functionality?

    To get you started, does this come close?
    (assumes you have a webbrowser control on your form with the webpage loaded into it).
    vb Code:
    1. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    2.     For Each el As HtmlElement In WebBrowser1.Document.Links
    3.         ListBox1.Items.Add(el.GetAttribute("href"))
    4.     Next
    5. End Sub

    It doesn't return image links as it is, but that should be easy to resolve.

  5. #5
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,335

    Re: Please help with this code !!!!!!!

    OK, that code uses regex and datatables and unfortunately I have very little understanding of either. Sorry

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Re: Please help with this code !!!!!!!

    Thanks for all your help.
    Quote Originally Posted by Inferrd View Post
    OK, that code uses regex and datatables and unfortunately I have very little understanding of either. Sorry

  7. #7
    Addicted Member
    Join Date
    Nov 2010
    Location
    TamilNadu, India
    Posts
    249

    Re: Please help with this code !!!!!!!

    hi first add a ListView control then run this code

    vb Code:
    1. Imports System.Net
    2. Imports System.Text.RegularExpressions
    3. Public Class Form1
    4.     Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    5.         ListView1.FullRowSelect = True
    6.         ListView1.GridLines = True
    7.         ListView1.Columns.Add("LinkUrl")
    8.         ListView1.Columns.Add("LinkText")
    9.         ListView1.View = View.Details
    10.     End Sub
    11.     Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    12.         ExtractLinks("http://www.ebay.com/")
    13.         For i% = 0 To ListView1.Columns.Count - 1
    14.             ListView1.Columns(i).Width = -2
    15.         Next i
    16.     End Sub
    17.     Public Function ExtractLinks(ByVal url As String) As Boolean
    18.         Try
    19.             Dim wc As New WebClient
    20.             Dim html As String = wc.DownloadString(url)
    21.             Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
    22.  
    23.             For Each match As Match In links
    24.                 Dim matchUrl As String = match.Groups(1).Value
    25.                 'Ignore all anchor links
    26.                 If matchUrl.StartsWith("#") Then
    27.                     Continue For
    28.                 End If
    29.                 'Ignore all javascript calls
    30.                 If matchUrl.ToLower.StartsWith("javascript:") Then
    31.                     Continue For
    32.                 End If
    33.                 'Ignore all email links
    34.                 If matchUrl.ToLower.StartsWith("mailto:") Then
    35.                     Continue For
    36.                 End If
    37.                 'For internal links, build the url mapped to the base address
    38.                 If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
    39.                     matchUrl = MapUrl(url, matchUrl)
    40.                 End If
    41.                 Dim _lsvItm As New ListViewItem(New String() {matchUrl, match.Groups(2).Value})
    42.                 ListView1.Items.Add(_lsvItm)
    43.             Next
    44.             Return True
    45.         Catch ex As Exception
    46.             Return False
    47.         End Try
    48.     End Function
    49.     Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
    50.         Dim u As New System.Uri(baseAddress)
    51.         If relativePath = "./" Then
    52.             relativePath = "/"
    53.         End If
    54.  
    55.         If relativePath.StartsWith("/") Then
    56.             Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
    57.         Else
    58.             Dim pathAndQuery As String = u.AbsolutePath
    59.             ' If the baseAddress contains a file name, like ..../Something.aspx
    60.             ' Trim off the file name
    61.             pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
    62.             If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
    63.                 pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
    64.             End If
    65.             baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
    66.             While relativePath.StartsWith("../")
    67.                 relativePath = relativePath.Substring(3)
    68.                 If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
    69.                     baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
    70.                 End If
    71.             End While
    72.             Return baseAddress + "/" + relativePath
    73.         End If
    74.     End Function
    75. End Class

  8. #8

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Question Re: Please help with this code !!!!!!!

    Quote Originally Posted by medsont View Post
    hi first add a ListView control then run this code

    vb Code:
    1. Imports System.Net
    2. Imports System.Text.RegularExpressions
    3. Public Class Form1
    4.     Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    5.         ListView1.FullRowSelect = True
    6.         ListView1.GridLines = True
    7.         ListView1.Columns.Add("LinkUrl")
    8.         ListView1.Columns.Add("LinkText")
    9.         ListView1.View = View.Details
    10.     End Sub
    11.     Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    12.         ExtractLinks("http://www.ebay.com/")
    13.         For i% = 0 To ListView1.Columns.Count - 1
    14.             ListView1.Columns(i).Width = -2
    15.         Next i
    16.     End Sub
    17.     Public Function ExtractLinks(ByVal url As String) As Boolean
    18.         Try
    19.             Dim wc As New WebClient
    20.             Dim html As String = wc.DownloadString(url)
    21.             Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
    22.  
    23.             For Each match As Match In links
    24.                 Dim matchUrl As String = match.Groups(1).Value
    25.                 'Ignore all anchor links
    26.                 If matchUrl.StartsWith("#") Then
    27.                     Continue For
    28.                 End If
    29.                 'Ignore all javascript calls
    30.                 If matchUrl.ToLower.StartsWith("javascript:") Then
    31.                     Continue For
    32.                 End If
    33.                 'Ignore all email links
    34.                 If matchUrl.ToLower.StartsWith("mailto:") Then
    35.                     Continue For
    36.                 End If
    37.                 'For internal links, build the url mapped to the base address
    38.                 If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
    39.                     matchUrl = MapUrl(url, matchUrl)
    40.                 End If
    41.                 Dim _lsvItm As New ListViewItem(New String() {matchUrl, match.Groups(2).Value})
    42.                 ListView1.Items.Add(_lsvItm)
    43.             Next
    44.             Return True
    45.         Catch ex As Exception
    46.             Return False
    47.         End Try
    48.     End Function
    49.     Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
    50.         Dim u As New System.Uri(baseAddress)
    51.         If relativePath = "./" Then
    52.             relativePath = "/"
    53.         End If
    54.  
    55.         If relativePath.StartsWith("/") Then
    56.             Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
    57.         Else
    58.             Dim pathAndQuery As String = u.AbsolutePath
    59.             ' If the baseAddress contains a file name, like ..../Something.aspx
    60.             ' Trim off the file name
    61.             pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
    62.             If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
    63.                 pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
    64.             End If
    65.             baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
    66.             While relativePath.StartsWith("../")
    67.                 relativePath = relativePath.Substring(3)
    68.                 If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
    69.                     baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
    70.                 End If
    71.             End While
    72.             Return baseAddress + "/" + relativePath
    73.         End If
    74.     End Function
    75. End Class
    Thanks this code work like a charm but I have a question , this code harvest unlimited url for example is a site has 5000 url? Thanks.

  9. #9
    Addicted Member
    Join Date
    Nov 2010
    Location
    TamilNadu, India
    Posts
    249

    Re: Please help with this code !!!!!!!

    a site has 5000 url?
    can u give me that web page link?...

  10. #10

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Re: Please help with this code !!!!!!!

    Quote Originally Posted by medsont View Post
    can u give me that web page link?...
    Is just a example can I use this code with any site included a large website for extract all urls?

  11. #11
    Addicted Member
    Join Date
    Nov 2010
    Location
    TamilNadu, India
    Posts
    249

    Re: Please help with this code !!!!!!!

    Is just a example can I use this code with any site included a large website for extract all urls?
    sure.......

  12. #12

    Thread Starter
    Hyperactive Member
    Join Date
    Jul 2011
    Posts
    294

    Re: Please help with this code !!!!!!!

    Thanks you ver much

  13. #13
    Addicted Member
    Join Date
    Nov 2010
    Location
    TamilNadu, India
    Posts
    249

    Re: [RESOLVED] Please help with this code !!!!!!!

    u r welcome n don't forgot to mark as Resolved this thread......

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width