|
-
Jul 10th, 2006, 02:04 PM
#1
Thread Starter
Lively Member
[02/03] Exracting urls from web page
Dear friends,
I am developing an windows application.the application deals with extracting urls from a specified webpage(web page name will be given dynamically).now i want to get all the urls/phone nos/fax no presented in that web page.friends give me ur valuable suggestions.
regards
kishore
-
Jul 10th, 2006, 02:51 PM
#2
Member
Re: [02/03] Exracting urls from web page
this would be not too easy... let's say this code would (try to) do this (All URL's must begin with http:// or ftp:// Let's say that you have whole HTML file dumped into a string S.) :
VB Code:
Dim s As String = 'here your code to get the site source HTML ;p
Dim i As Integer = 0
Do Until i = -1
i=s.IndexOf("http://",i) 'i as second parameter - begin from last url :P
Loop
And i'm not sure if this is working as i don't currently have Vb installed.
If my post is helpful please RATE IT! Thanks
-
Jul 10th, 2006, 03:08 PM
#3
Fanatic Member
Re: [02/03] Exracting urls from web page
hmm
this one looks pretty juicy
-
Jul 10th, 2006, 03:14 PM
#4
-
Jul 10th, 2006, 03:22 PM
#5
Thread Starter
Lively Member
Re: [02/03] Exracting urls from web page
thanku sheilkh
could u plz tell me in detail
regards
kishore
-
Jul 10th, 2006, 03:37 PM
#6
Hyperactive Member
Re: [02/03] Exracting urls from web page
Oh yea I totally forgot about parsing! Thanks Tokersball_CDXX! You could parse the html document and read the content file of it in html and just find the part with "http://". You could have the documenttext of the document in the webbrowser control show up. Create a new project. Add two buttons, a richtextbox control, a webbrowser control and a new textbox. Here is the code:
VB Code:
Public Class Form1
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
WebBrowser1.Navigate(TextBox1.Text)
End Sub
Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
RichTextBox1.Text = WebBrowser1.DocumentText
End Sub
End Class
Now when you click button1, it will navigate to the url you gave in the textbox, then click button2 and the richtextbox will show the html of the document you just went to. Now all you have to do is find the "http://" in the html and you are set! I will see what I can do .
Last edited by sheikh78; Jul 10th, 2006 at 03:43 PM.
"Imagination is more important than knowledge" - Albert Einstein, born on March 14th 1879.
Can't find it here on VBForums? Go to the CodeProject. MSDN is your friend . I have such a bad website, my friend decided it would be funny to change the template and he moderates the site for me: visit my site!
"Thinking of you, wherever you are
We pray for our sorrows to end, and hope that our hearts will blend.
Now I will step forward to realize this wish.
And who knows, starting a new journey may not be so hard…
Or maybe it has already begun.
There are many worlds, but they share the same sky
one sky, one destiny..."
-
Jul 11th, 2006, 08:42 AM
#7
Fanatic Member
Re: [02/03] Exracting urls from web page
as referenced above
a seperate class:
VB Code:
Imports System.IO
Imports System.Net
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Public Class HTMLContentParser
Public Function Return_HTMLContent(ByVal sURL As String)
Dim sStream As Stream
Dim URLReq As HttpWebRequest
Dim URLRes As HttpWebResponse
Try
URLReq = WebRequest.Create(sURL)
URLRes = URLReq.GetResponse()
sStream = URLRes.GetResponseStream()
Return New StreamReader(sStream).ReadToEnd()
Catch ex As Exception
Return ex.Message
End Try
End Function
Public Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList
rRegEx = New Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Public Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList
rRegEx = New Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then
sURL += "/"
'FILTERING
'Filter down to the Domain Name Directory from the Right
Dim iCount As Integer
For iCount = sURL.Length To 1 Step -1
If Mid(sURL, iCount, 1) = "/" Then
sURL = Left(sURL, iCount)
Exit For
End If
Next
'Filter out the ">" from the Left
For iCount = 1 To sInput.Length
If Mid(sInput, iCount, 4) = ">" Then
sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
Exit For
End If
Next
'Filter out unnecessary Characters
sInput = sInput.Replace("<", Chr(39))
sInput = sInput.Replace(">", Chr(39))
sInput = sInput.Replace("""", "")
sInput = sInput.Replace("'", "")
If (sInput.IndexOf("http://") < 0) Then
If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
Return sURL & "/" & sInput
Else
If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
Return sURL.Substring(0, sURL.Length - 1) + sInput
Else
Return sURL + sInput
End If
End If
Else
Return sInput
End If
End Function
End Class
now on your form...
VB Code:
Private Sub Button1_Click_1(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim GetUrls As New HTMLContentParser
Dim ArrayOfUrls As New ArrayList
ArrayOfUrls = GetUrls.ParseHTMLLinks(GetUrls.Return_HTMLContent("http://www.search.com"), "http://www.search.com")
Me.ListBox1.DataSource = ArrayOfUrls
End Sub
works rather nicely, surely with a little effort you can craft a spider from this example.
-
Jul 11th, 2006, 12:18 PM
#8
Thread Starter
Lively Member
Re: [02/03] Exracting urls from web page
Dear TokersBall_CDXX
Thank u very much.ur work is amazing.thank u very much.
-
Jul 11th, 2006, 12:20 PM
#9
Thread Starter
Lively Member
Re: [02/03] Exracting urls from web page
Dear Sheikh
Thank u very much.
Regards
kishore
-
Jul 11th, 2006, 12:26 PM
#10
Fanatic Member
Re: [02/03] Exracting urls from web page
 Originally Posted by vkkishore_s
Dear TokersBall_CDXX
Thank u very much.ur work is amazing.thank u very much.
the only part that is my work is the form code, the class came from an author on the referenced link above.
-
Jul 11th, 2006, 01:54 PM
#11
Re: [02/03] Exracting urls from web page
There have been several threads addressing this topic on this forum. A few that I have responded to with links to a screen scraper project using Regex can be found in them. Some threads:
http://www.vbforums.com/showthread.php?t=396773
http://www.vbforums.com/showthread.php?t=395197
Might be worth a look....
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|