Results 1 to 5 of 5

Thread: robots.txt regular expression

  1. #1

    Thread Starter
    Hyperactive Member Grunt's Avatar
    Join Date
    Oct 2004
    Location
    Las Vegas
    Posts
    499

    robots.txt regular expression

    I am looking for a regular expression that will let me get all lines between "User-agent: *" and a new line. I tried a bunch of things but am not good at regular expressions.

  2. #2
    PowerPoster cicatrix's Avatar
    Join Date
    Dec 2009
    Location
    Moscow, Russia
    Posts
    3,654

    Re: robots.txt regular expression

    I think that Regular Expression is too heavy a solution for a task like this. Simply get the line and remove "User-agent: *" part from it.

    Code:
    Dim Pattern As String = "User-agent: *"
    Dim NewLine As String = LineFromRobotsFile.Replace(Pattern, String.Empty)

  3. #3
    VB Addict Pradeep1210's Avatar
    Join Date
    Apr 2004
    Location
    Inside the CPU...
    Posts
    6,614

    Re: robots.txt regular expression

    Code:
    Dim lines() As String = Split(LineFromRobotsFile,vbcrlf)
    For each line in lines
        If line like "User-agent: *" Then
            msgbox (line)
            exit for
        End If
    Next
    Pradeep, Microsoft MVP (Visual Basic)
    Please appreciate posts that have helped you by clicking icon on the left of the post.
    "A problem well stated is a problem half solved." — Charles F. Kettering

    Read articles on My Blog101 LINQ SamplesJSON ValidatorXML Schema Validator"How Do I" videos on MSDNVB.NET and C# ComparisonGood Coding PracticesVBForums Reputation SaverString EnumSuper Simple Tetris Game


    (2010-2013)
    NB: I do not answer coding questions via PM. If you want my help, then make a post and PM me it's link. If I can help, trust me I will...

  4. #4

    Thread Starter
    Hyperactive Member Grunt's Avatar
    Join Date
    Oct 2004
    Location
    Las Vegas
    Posts
    499

    Re: robots.txt regular expression

    I think this will do the trick. Havent tested though.

    Code:
     Private Function IsAllowed(ByVal url As String) As Boolean
        	Dim wb As New WebClient
        	wb.Headers.Add("user-agent", useragent)
        	
        	Dim nocrawl As New List(Of String)
        	
        	Dim u As New Uri(url)
        	dim robotsurl as String = "http://" & u.Authority & "/robots.txt"
        		
        	Dim data As String = wb.DownloadString(robotsurl)
        	
        	Dim lines() As String = data.Split(controlchars.NewLine)
        	
        	dim applytobot as Boolean = false
        	
        	For Each line As String In lines
        		If line.Contains("user-agent") Then
        			if line.Contains("User-agent: *") or line.Contains("User-agent: " & useragent)
        				'we want to pay attention to these lines
        				applytobot = True
        			Else
        				applytobot = False
        			end if
        		ElseIf line.Contains("Disallow")
        			If applytobot = True Then
        				'split on the space
        				Dim parts() As String = line.Split(" ")
        				
        				Dim abs As String = parts(1)
        				
        				If parts(1) = "/" Then
        					'we are disallowed from crawling any url that contains the base url
        					return false
        				Else
        					nocrawl.Add("http://" & u.authority & parts(1))
        				End If
        			End If
        		End If
        	Next
        	
        	If nocrawl.Contains(url) = True Then
        		Return False
        	Else
        		return true
        	End If
        	
        End Function

  5. #5

    Thread Starter
    Hyperactive Member Grunt's Avatar
    Join Date
    Oct 2004
    Location
    Las Vegas
    Posts
    499

    Re: robots.txt regular expression

    this doesnt quite work. I just want to know if a url is allowed or not.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width