I am looking for a regular expression that will let me get all lines between "User-agent: *" and a new line. I tried a bunch of things but am not good at regular expressions.
Printable View
I am looking for a regular expression that will let me get all lines between "User-agent: *" and a new line. I tried a bunch of things but am not good at regular expressions.
I think that Regular Expression is too heavy a solution for a task like this. Simply get the line and remove "User-agent: *" part from it.
Code:Dim Pattern As String = "User-agent: *"
Dim NewLine As String = LineFromRobotsFile.Replace(Pattern, String.Empty)
Code:Dim lines() As String = Split(LineFromRobotsFile,vbcrlf)
For each line in lines
If line like "User-agent: *" Then
msgbox (line)
exit for
End If
Next
I think this will do the trick. Havent tested though.
Code:Private Function IsAllowed(ByVal url As String) As Boolean
Dim wb As New WebClient
wb.Headers.Add("user-agent", useragent)
Dim nocrawl As New List(Of String)
Dim u As New Uri(url)
dim robotsurl as String = "http://" & u.Authority & "/robots.txt"
Dim data As String = wb.DownloadString(robotsurl)
Dim lines() As String = data.Split(controlchars.NewLine)
dim applytobot as Boolean = false
For Each line As String In lines
If line.Contains("user-agent") Then
if line.Contains("User-agent: *") or line.Contains("User-agent: " & useragent)
'we want to pay attention to these lines
applytobot = True
Else
applytobot = False
end if
ElseIf line.Contains("Disallow")
If applytobot = True Then
'split on the space
Dim parts() As String = line.Split(" ")
Dim abs As String = parts(1)
If parts(1) = "/" Then
'we are disallowed from crawling any url that contains the base url
return false
Else
nocrawl.Add("http://" & u.authority & parts(1))
End If
End If
End If
Next
If nocrawl.Contains(url) = True Then
Return False
Else
return true
End If
End Function
this doesnt quite work. I just want to know if a url is allowed or not.