Results 1 to 1 of 1

Thread: The Basic Principals of Lexical Analysis

  1. #1

    Thread Starter
    Super Moderator dday9's Avatar
    Join Date
    Mar 2011
    Location
    South Louisiana
    Posts
    11,698

    The Basic Principals of Lexical Analysis

    Prerequisites
    You must have a solid understanding of the following:
    • The principles of your desired programming language.
    • Be familiar with common programming terms e.g. iterate, conditional statement, etc.
    • If you are not using Visual Basic .NET then you should understand how the code I present to you translates into your language.


    Terminology
    I will be using some terms throughout this tutorial that are not used in everyday coding so I would like to define how I will be using them:
    • Lexeme – The most basic unit in a given source code e.g. in the source code:
      Code:
      Dim foo As Integer = 1
      Each word, symbol, and number represents a lexeme.
    • Token – A key/value pair where the key is typically a rule name and the value is typically a lexeme e.g. in the source code:
      Code:
      Dim foo As Integer = 1
      The equal sign would likely be a token where the key is “assignment” and the value is “1”.


    Prefix
    Programming Language: Visual Basic .NET using the .NET framework 4.5
    Target Language: JSON
    Why JSON:
    • JSON’s syntax make it one of the easiest languages to scan/parse.
    • The syntax will likely stay contant for a very long time.


    Lexical Analysis - Basics
    The process of lexical analysis, also known as scanning, is to read the source code and break up the lexemes into meaningful tokens.

    The first step in the lexical analysis phase is create a Token class. This class can be as simple as two String properties: Category and Value.

    The Category property will represent the group that a lexeme belongs to.

    The Value property will represent the lexeme value found.

    Token
    Code:
    Public Class Token
    
        Private _category As String
        Public Property Category As String
            Get
                Return _category
            End Get
            Set(ByVal value As String)
                If Not String.Equals(_category, value) Then
                    _category = value
                End If
            End Set
        End Property
    
        Private _value As String
        Public Property Value As String
            Get
                Return _value
            End Get
            Set(ByVal value As String)
                If Not String.Equals(_value, value) Then
                    _value = value
                End If
            End Set
        End Property
    
        Protected Overridable Sub OnCategoryChanged()
            RaiseEvent CategoryChanged(Me, EventArgs.Empty)
        End Sub
    
        Protected Overridable Sub OnValueChanged()
            RaiseEvent ValueChanged(Me, EventArgs.Empty)
        End Sub
    
        Public Event CategoryChanged(ByVal sender As Object, ByVal e As EventArgs)
        Public Event ValueChanged(ByVal sender As Object, ByVal e As EventArgs)
    
        Public Sub New()
            _category = String.Empty
            _value = String.Empty
        End Sub
    
        Public Sub New(ByVal category As String, ByVal value As String)
            _category = category
            _value = value
        End Sub
    
    End Class
    Lexical Analysis - Categories
    Now that there is a structure in place to hold the tokens we need to create functions that will match lexemes based by category.

    In JSON there are 5 different categories:
    1. String
    2. Number
    3. Boolean
    4. Symbols
    5. Whitespace


    The following image is a railroad diagram of a String as defined by the JSON language:
    Name:  string.gif
Views: 1191
Size:  16.6 KB

    This means that we must match any Unicode character except for escaped characters and any escaped characters. The following are all escaped characters which means that they are preceded by a reverse solidus:
    1. "
    2. \
    3. /
    4. b
    5. f
    6. n
    7. r
    8. t
    9. u


    The simplest way to match a JSON String is to use regular expressions. The regular expression pattern that I use is:
    Code:
    "([^"\\]|\\[\\\/bfnrt])*
    The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

    Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category string and the value equal to the everything between the opening and closing quotes.

    Finally we will shorten the source parameter by the length of the matched String.
    Code:
    Private Function Scan_String(ByRef source As String) As Token
        Dim m As Match = Regex.Match(source, """([^""\\]|\\[""\\\/bfnrt])*""")
        Dim t As Token = Nothing
    
        If m.Success AndAlso m.Index = 0 Then
            Dim literal As String = m.Value.Remove(0, 1)
            literal = literal.Remove(literal.Length - 1)
            t = New Token("string", literal)
            source = source.Substring(m.Value.Length)
        End If
    
        Return t
    End Function
    The following image is a railroad diagram of a Number as defined by the JSON language:
    Name:  number.gif
Views: 1174
Size:  9.5 KB

    This means that we can have an optional negative sign, followed by any digit any number of times. We can also optionally have a single decimal place followed by any digit 1 or more times. We can also optionally have the letter e or E followed by an optional + or – sign followed by any digit 1 or more times.

    The simplest way to match a JSON Number is to use regular expressions. The regular expression pattern that I use is: -?\d+((\.\d+)?([eE][+-]?\d+)?)

    The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

    Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category number and the value equal to the match.

    Finally we will shorten the source parameter by the length of the matched String.
    Code:
    Private Function Scan_Number(ByRef source As String) As Token
        Dim m As Match = Regex.Match(source, "-?\d+((\.\d+)?([eE][+-]?\d+)?)")
        Dim t As Token = Nothing
    
        If m.Success AndAlso m.Index = 0 Then
            t = New Token(“number", m.Value)
            source = source.Substring(m.Value.Length)
        End If
    
        Return t
    End Function
    The following image is a railroad diagram of a Boolean as defined by the JSON language:
    Name:  boolean.gif
Views: 1174
Size:  4.0 KB
    This means that we need to match the literal values:
    • True
    • False
    • Null


    While you could match JSON Boolean values using regular expressions, it is much more efficient to match the literals using built in comparison methods

    The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

    Inside the function we will check for if the substring is any of the Boolean literals. If the match meets the conditions then return a Token with the category boolean and the value equal to the match.

    Finally we will shorten the source parameter by the length of the matched String.
    Code:
    Private Function Scan_Boolean(ByRef source As String) As Token
        Dim t As Token = Nothing
    
        If source.Length > 0 AndAlso source.Length >= 4 AndAlso (source.Substring(0, 4).Equals(“true”, StringComparison.OrdinalIgnoreCase) OrElse source.Substring(0, 4).Equals(“null”, StringComparison.OrdinalIgnoreCase)) Then
            t = New Token(“boolean", source.Substring(0, 4))
            source = source.Substring(4)
        ElseIf source.Length > 0 AndAlso source.Length >= 5 AndAlso source.Substring(0, 5).Equals(“false”, StringComparison.OrdinalIgnoreCase) 
            t = New Token(“boolean", source.Substring(0, 4))
            source = source.Substring(4)
        End If
    
        Return t
    End Function
    The following image is a railroad diagram of a Symbol as defined by the JSON language:
    Name:  symbol.gif
Views: 1171
Size:  6.3 KB

    This means that we need to match the literal values:
    • [
    • ]
    • {
    • }
    • :
    • ,


    While you could match JSON symbol values using regular expressions, it is much more efficient to match the literals using built in comparison methods

    The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

    Inside the function we will check for if the index of an array containing any of the symbol literals is not -1 for the first letter in the source code. If the match meets the conditions then return a Token with the category symbol and the value equal to the match.

    Finally we will shorten the source parameter by the length of the matched String.
    Code:
    Private Function Scan_Symbol(ByRef source As String) As Token
        Dim t As Token = Nothing
        Dim symbols() As Char = "[]{}:,".ToCharArray()
    
        If source.Length > 0 AndAlso Array.IndexOf(symbols, source(0)) <> -1 Then
            t = New Token(“symbol", source(0))
            source = source.Remove(0, 1)
        End If
    
        Return t
    End Function
    Whitespace is ignored in JSON. So what we will need to do is check if the first letter in the source code is either empty or whitespace and if so then remove the first letter from the source code without returning anything. However, you will still want to have the return type of the function as a Token and I will expand on why a little later.
    Code:
    Private Function Scan_Whitespace(ByRef source As String) As Token
        If source.Length > 0 AndAlso String.IsNullOrWhitespace(source(0)) Then
            source = source.Remove(0, 1)
        End If
    
        Return Nothing
    End Function
    Lexical Analysis - Scanning
    Now that all the definitions are defined as functions, the next step is to loop through the source code until the length of the source code is 0.

    This will be done by creating a function that returns a fixed size collection of Tokens that has a single ByVal String parameter that represents the source code.

    Inside the function, use a Do/Until loop to loop until the length of the source code is 0. The following code is the shell for our function:
    Code:
    Private Function Scan(ByVal source As String) As Token()
        Do
    
        Loop Until source.Length = 0
    End Function
    Inside the Do/Until loop we will want to check if the remaining source code matches any of the definitions we just created and the reason why I suggested making the return type of the whitespace definition a Token is because in Visual Basic .NET we can create a Delegate Function that groups all the definitions together. The following code should be placed outside of the function:
    Code:
    Private Delegate Function Scan_Definition(ByRef source As String) As Token
    Now outside of the Do/Until loop in the Scan function, declare:
    1. A dynamic sized collection variable to hold the scanned tokens
    2. A fixed sized collection variable to hold all the definitions
    3. A placeholder token that will be reused within the Do/Until loop


    The following code is still not the final product of the Scan function but should be reflective of the code used so far:
    Code:
    Private Function Scan(ByVal source As String) As Token()
        Dim lexedTokens As New List(Of Token)
        Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
        Dim t As Token = Nothing
    
        Do
    
        Loop Until source.Length = 0
    End Function
    Now inside of the Do/Until loop, loop through each definition and assign the token placeholder equal to the result of the function. If the token is not a null result then add it to the collection to be returned.

    Finally return the collection as a fixed sized collection. The following code is the final draft of the Scan function:
    Code:
    Private Function Scan(ByVal source As String) As Token()
        Dim lexedTokens As New List(Of Token)
        Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
        Dim t As Token = Nothing
    
        Do
            For Each definition As Scan_Definition In definitions
                t = definition.Invoke(source)
                If t IsNot Nothing Then
                    lexedTokens.Add(t)
                End If
            Next
        Loop Until source.Length = 0
    
        Return lexedTokens.ToArray()
    End Function
    Notes
    As you can tell the process of lexical analysis is fairly simple, and this process that I have shown is universal. You could do the same thing for just about any other language that you need to scan.

    One thing to point out is that there is no prioritization in JSON. In some language, there is an identifier category in which can match several different categories if all else fails. In scenarios such as these, you would stop iterating through each definition and then let the Do/Until loop repeat to start the process over.
    Last edited by dday9; May 11th, 2016 at 10:10 AM.
    "Code is like humor. When you have to explain it, it is bad." - Cory House
    VbLessons | Code Tags | Sword of Fury - Jameram

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width