Results 1 to 11 of 11

Thread: [02/03] Exracting urls from web page

  1. #1

    Thread Starter
    Lively Member
    Join Date
    Jun 2006
    Posts
    116

    [02/03] Exracting urls from web page

    Dear friends,

    I am developing an windows application.the application deals with extracting urls from a specified webpage(web page name will be given dynamically).now i want to get all the urls/phone nos/fax no presented in that web page.friends give me ur valuable suggestions.

    regards
    kishore

  2. #2
    Member
    Join Date
    Oct 2005
    Location
    C:\Downloads at 87.105.109.239
    Posts
    39

    Re: [02/03] Exracting urls from web page

    this would be not too easy... let's say this code would (try to) do this (All URL's must begin with http:// or ftp:// Let's say that you have whole HTML file dumped into a string S.) :
    VB Code:
    1. Dim s As String = 'here your code to get the site source HTML ;p
    2. Dim i As Integer = 0
    3. Do Until i = -1
    4. i=s.IndexOf("http://",i) 'i as second parameter - begin from last url :P
    5. Loop
    And i'm not sure if this is working as i don't currently have Vb installed.
    If my post is helpful please RATE IT! Thanks

  3. #3
    Fanatic Member TokersBall_CDXX's Avatar
    Join Date
    Mar 2003
    Location
    America
    Posts
    571

    Re: [02/03] Exracting urls from web page

    hmm
    this one looks pretty juicy
    Build your own personalized flash based chat room for your webpage for FREE! http://www.4computerheaven.com

  4. #4
    Hyperactive Member sheikh78's Avatar
    Join Date
    Apr 2006
    Location
    C:/
    Posts
    423

    Re: [02/03] Exracting urls from web page

    So pretty much you are trying to find all the urls on the specified webpage.....kartam's code won't work because i just tried it with google, one of the most simple websites and it came up with a loop that never stopped. Well if you are trying to find all the links on the webpage open the page in a webbrowser control and use the webbrowser.document property to guide you from there.
    "Imagination is more important than knowledge" - Albert Einstein, born on March 14th 1879.
    Can't find it here on VBForums? Go to the CodeProject. MSDN is your friend . I have such a bad website, my friend decided it would be funny to change the template and he moderates the site for me: visit my site!

    "Thinking of you, wherever you are
    We pray for our sorrows to end, and hope that our hearts will blend.
    Now I will step forward to realize this wish.
    And who knows, starting a new journey may not be so hard…
    Or maybe it has already begun.
    There are many worlds, but they share the same sky
    one sky, one destiny..."

  5. #5

    Thread Starter
    Lively Member
    Join Date
    Jun 2006
    Posts
    116

    Re: [02/03] Exracting urls from web page

    thanku sheilkh

    could u plz tell me in detail


    regards
    kishore

  6. #6
    Hyperactive Member sheikh78's Avatar
    Join Date
    Apr 2006
    Location
    C:/
    Posts
    423

    Re: [02/03] Exracting urls from web page

    Oh yea I totally forgot about parsing! Thanks Tokersball_CDXX! You could parse the html document and read the content file of it in html and just find the part with "http://". You could have the documenttext of the document in the webbrowser control show up. Create a new project. Add two buttons, a richtextbox control, a webbrowser control and a new textbox. Here is the code:

    VB Code:
    1. Public Class Form1
    2.  
    3.     Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    4.         WebBrowser1.Navigate(TextBox1.Text)
    5.     End Sub
    6.  
    7.     Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
    8.         RichTextBox1.Text = WebBrowser1.DocumentText
    9.     End Sub
    10. End Class

    Now when you click button1, it will navigate to the url you gave in the textbox, then click button2 and the richtextbox will show the html of the document you just went to. Now all you have to do is find the "http://" in the html and you are set! I will see what I can do .
    Last edited by sheikh78; Jul 10th, 2006 at 03:43 PM.
    "Imagination is more important than knowledge" - Albert Einstein, born on March 14th 1879.
    Can't find it here on VBForums? Go to the CodeProject. MSDN is your friend . I have such a bad website, my friend decided it would be funny to change the template and he moderates the site for me: visit my site!

    "Thinking of you, wherever you are
    We pray for our sorrows to end, and hope that our hearts will blend.
    Now I will step forward to realize this wish.
    And who knows, starting a new journey may not be so hard…
    Or maybe it has already begun.
    There are many worlds, but they share the same sky
    one sky, one destiny..."

  7. #7
    Fanatic Member TokersBall_CDXX's Avatar
    Join Date
    Mar 2003
    Location
    America
    Posts
    571

    Re: [02/03] Exracting urls from web page

    as referenced above
    a seperate class:
    VB Code:
    1. Imports System.IO
    2. Imports System.Net
    3. Imports System
    4. Imports System.Text
    5. Imports System.Text.RegularExpressions
    6. Public Class HTMLContentParser
    7.     Public Function Return_HTMLContent(ByVal sURL As String)
    8.         Dim sStream As Stream
    9.         Dim URLReq As HttpWebRequest
    10.         Dim URLRes As HttpWebResponse
    11.         Try
    12.             URLReq = WebRequest.Create(sURL)
    13.             URLRes = URLReq.GetResponse()
    14.             sStream = URLRes.GetResponseStream()
    15.             Return New StreamReader(sStream).ReadToEnd()
    16.         Catch ex As Exception
    17.             Return ex.Message
    18.         End Try
    19.     End Function
    20.     Public Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
    21.         Dim rRegEx As Regex
    22.         Dim mMatch As Match
    23.         Dim aMatch As New ArrayList
    24.         rRegEx = New Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", RegexOptions.IgnoreCase Or RegexOptions.Compiled)
    25.         mMatch = rRegEx.Match(sHTMLContent)
    26.         While mMatch.Success
    27.             Dim sMatch As String
    28.             sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
    29.             aMatch.Add(sMatch)
    30.             mMatch = mMatch.NextMatch()
    31.         End While
    32.         Return aMatch
    33.     End Function
    34.     Public Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
    35.         Dim rRegEx As Regex
    36.         Dim mMatch As Match
    37.         Dim aMatch As New ArrayList
    38.         rRegEx = New Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", RegexOptions.IgnoreCase Or RegexOptions.Compiled)
    39.         mMatch = rRegEx.Match(sHTMLContent)
    40.         While mMatch.Success
    41.             Dim sMatch As String
    42.             sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
    43.             aMatch.Add(sMatch)
    44.             mMatch = mMatch.NextMatch()
    45.         End While
    46.         Return aMatch
    47.     End Function
    48.     Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
    49.         'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then
    50.         sURL += "/"
    51.         'FILTERING
    52.         'Filter down to the Domain Name Directory from the Right
    53.         Dim iCount As Integer
    54.         For iCount = sURL.Length To 1 Step -1
    55.             If Mid(sURL, iCount, 1) = "/" Then
    56.                 sURL = Left(sURL, iCount)
    57.                 Exit For
    58.             End If
    59.         Next
    60.         'Filter out the ">" from the Left
    61.         For iCount = 1 To sInput.Length
    62.             If Mid(sInput, iCount, 4) = ">" Then
    63.                 sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
    64.                 Exit For
    65.             End If
    66.         Next
    67.         'Filter out unnecessary Characters
    68.         sInput = sInput.Replace("<", Chr(39))
    69.         sInput = sInput.Replace(">", Chr(39))
    70.         sInput = sInput.Replace("""", "")
    71.         sInput = sInput.Replace("'", "")
    72.         If (sInput.IndexOf("http://") < 0) Then
    73.             If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
    74.                 Return sURL & "/" & sInput
    75.             Else
    76.                 If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
    77.                     Return sURL.Substring(0, sURL.Length - 1) + sInput
    78.                 Else
    79.                     Return sURL + sInput
    80.                 End If
    81.             End If
    82.         Else
    83.             Return sInput
    84.         End If
    85.     End Function
    86. End Class


    now on your form...

    VB Code:
    1. Private Sub Button1_Click_1(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    2.         Dim GetUrls As New HTMLContentParser
    3.         Dim ArrayOfUrls As New ArrayList
    4.         ArrayOfUrls = GetUrls.ParseHTMLLinks(GetUrls.Return_HTMLContent("http://www.search.com"), "http://www.search.com")
    5.         Me.ListBox1.DataSource = ArrayOfUrls
    6.     End Sub

    works rather nicely, surely with a little effort you can craft a spider from this example.
    Build your own personalized flash based chat room for your webpage for FREE! http://www.4computerheaven.com

  8. #8

    Thread Starter
    Lively Member
    Join Date
    Jun 2006
    Posts
    116

    Re: [02/03] Exracting urls from web page

    Dear TokersBall_CDXX

    Thank u very much.ur work is amazing.thank u very much.

  9. #9

    Thread Starter
    Lively Member
    Join Date
    Jun 2006
    Posts
    116

    Re: [02/03] Exracting urls from web page

    Dear Sheikh

    Thank u very much.

    Regards
    kishore

  10. #10
    Fanatic Member TokersBall_CDXX's Avatar
    Join Date
    Mar 2003
    Location
    America
    Posts
    571

    Re: [02/03] Exracting urls from web page

    Quote Originally Posted by vkkishore_s
    Dear TokersBall_CDXX

    Thank u very much.ur work is amazing.thank u very much.
    the only part that is my work is the form code, the class came from an author on the referenced link above.
    Build your own personalized flash based chat room for your webpage for FREE! http://www.4computerheaven.com

  11. #11
    PowerPoster
    Join Date
    Aug 2005
    Location
    College Station, TX
    Posts
    4,521

    Re: [02/03] Exracting urls from web page

    There have been several threads addressing this topic on this forum. A few that I have responded to with links to a screen scraper project using Regex can be found in them. Some threads:

    http://www.vbforums.com/showthread.php?t=396773
    http://www.vbforums.com/showthread.php?t=395197

    Might be worth a look....

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width