Prerequisites
You must have a solid understanding of the following:
- The principles of your desired programming language.
- Be familiar with common programming terms e.g. iterate, conditional statement, etc.
- If you are not using Visual Basic .NET then you should understand how the code I present to you translates into your language.
Terminology
I will be using some terms throughout this tutorial that are not used in everyday coding so I would like to define how I will be using them:
- Lexeme – The most basic unit in a given source code e.g. in the source code:
Code:
Dim foo As Integer = 1
Each word, symbol, and number represents a lexeme. - Token – A key/value pair where the key is typically a rule name and the value is typically a lexeme e.g. in the source code:
Code:
Dim foo As Integer = 1
The equal sign would likely be a token where the key is “assignment” and the value is “1”.
Prefix
Programming Language: Visual Basic .NET using the .NET framework 4.5
Target Language: JSON
Why JSON:
- JSON’s syntax make it one of the easiest languages to scan/parse.
- The syntax will likely stay contant for a very long time.
Lexical Analysis - Basics
The process of lexical analysis, also known as scanning, is to read the source code and break up the lexemes into meaningful tokens.
The first step in the lexical analysis phase is create a Token class. This class can be as simple as two String properties: Category and Value.
The Category property will represent the group that a lexeme belongs to.
The Value property will represent the lexeme value found.
Token
Code:
Public Class Token
Private _category As String
Public Property Category As String
Get
Return _category
End Get
Set(ByVal value As String)
If Not String.Equals(_category, value) Then
_category = value
End If
End Set
End Property
Private _value As String
Public Property Value As String
Get
Return _value
End Get
Set(ByVal value As String)
If Not String.Equals(_value, value) Then
_value = value
End If
End Set
End Property
Protected Overridable Sub OnCategoryChanged()
RaiseEvent CategoryChanged(Me, EventArgs.Empty)
End Sub
Protected Overridable Sub OnValueChanged()
RaiseEvent ValueChanged(Me, EventArgs.Empty)
End Sub
Public Event CategoryChanged(ByVal sender As Object, ByVal e As EventArgs)
Public Event ValueChanged(ByVal sender As Object, ByVal e As EventArgs)
Public Sub New()
_category = String.Empty
_value = String.Empty
End Sub
Public Sub New(ByVal category As String, ByVal value As String)
_category = category
_value = value
End Sub
End Class
Lexical Analysis - Categories
Now that there is a structure in place to hold the tokens we need to create functions that will match lexemes based by category.
In JSON there are 5 different categories:
- String
- Number
- Boolean
- Symbols
- Whitespace
The following image is a railroad diagram of a String as defined by the JSON language:

This means that we must match any Unicode character except for escaped characters and any escaped characters. The following are all escaped characters which means that they are preceded by a reverse solidus:
- "
- \
- /
- b
- f
- n
- r
- t
- u
The simplest way to match a JSON String is to use regular expressions. The regular expression pattern that I use is:
Code:
"([^"\\]|\\[\\\/bfnrt])*
The function used will have a single ByRef String parameter to represent the source code that we have left to scan.
Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category string and the value equal to the everything between the opening and closing quotes.
Finally we will shorten the source parameter by the length of the matched String.
Code:
Private Function Scan_String(ByRef source As String) As Token
Dim m As Match = Regex.Match(source, """([^""\\]|\\[""\\\/bfnrt])*""")
Dim t As Token = Nothing
If m.Success AndAlso m.Index = 0 Then
Dim literal As String = m.Value.Remove(0, 1)
literal = literal.Remove(literal.Length - 1)
t = New Token("string", literal)
source = source.Substring(m.Value.Length)
End If
Return t
End Function
The following image is a railroad diagram of a Number as defined by the JSON language:

This means that we can have an optional negative sign, followed by any digit any number of times. We can also optionally have a single decimal place followed by any digit 1 or more times. We can also optionally have the letter e or E followed by an optional + or – sign followed by any digit 1 or more times.
The simplest way to match a JSON Number is to use regular expressions. The regular expression pattern that I use is: -?\d+((\.\d+)?([eE][+-]?\d+)?)
The function used will have a single ByRef String parameter to represent the source code that we have left to scan.
Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category number and the value equal to the match.
Finally we will shorten the source parameter by the length of the matched String.
Code:
Private Function Scan_Number(ByRef source As String) As Token
Dim m As Match = Regex.Match(source, "-?\d+((\.\d+)?([eE][+-]?\d+)?)")
Dim t As Token = Nothing
If m.Success AndAlso m.Index = 0 Then
t = New Token(“number", m.Value)
source = source.Substring(m.Value.Length)
End If
Return t
End Function
The following image is a railroad diagram of a Boolean as defined by the JSON language:

This means that we need to match the literal values:
While you could match JSON Boolean values using regular expressions, it is much more efficient to match the literals using built in comparison methods
The function used will have a single ByRef String parameter to represent the source code that we have left to scan.
Inside the function we will check for if the substring is any of the Boolean literals. If the match meets the conditions then return a Token with the category boolean and the value equal to the match.
Finally we will shorten the source parameter by the length of the matched String.
Code:
Private Function Scan_Boolean(ByRef source As String) As Token
Dim t As Token = Nothing
If source.Length > 0 AndAlso source.Length >= 4 AndAlso (source.Substring(0, 4).Equals(“true”, StringComparison.OrdinalIgnoreCase) OrElse source.Substring(0, 4).Equals(“null”, StringComparison.OrdinalIgnoreCase)) Then
t = New Token(“boolean", source.Substring(0, 4))
source = source.Substring(4)
ElseIf source.Length > 0 AndAlso source.Length >= 5 AndAlso source.Substring(0, 5).Equals(“false”, StringComparison.OrdinalIgnoreCase)
t = New Token(“boolean", source.Substring(0, 4))
source = source.Substring(4)
End If
Return t
End Function
The following image is a railroad diagram of a Symbol as defined by the JSON language:

This means that we need to match the literal values:
While you could match JSON symbol values using regular expressions, it is much more efficient to match the literals using built in comparison methods
The function used will have a single ByRef String parameter to represent the source code that we have left to scan.
Inside the function we will check for if the index of an array containing any of the symbol literals is not -1 for the first letter in the source code. If the match meets the conditions then return a Token with the category symbol and the value equal to the match.
Finally we will shorten the source parameter by the length of the matched String.
Code:
Private Function Scan_Symbol(ByRef source As String) As Token
Dim t As Token = Nothing
Dim symbols() As Char = "[]{}:,".ToCharArray()
If source.Length > 0 AndAlso Array.IndexOf(symbols, source(0)) <> -1 Then
t = New Token(“symbol", source(0))
source = source.Remove(0, 1)
End If
Return t
End Function
Whitespace is ignored in JSON. So what we will need to do is check if the first letter in the source code is either empty or whitespace and if so then remove the first letter from the source code without returning anything. However, you will still want to have the return type of the function as a Token and I will expand on why a little later.
Code:
Private Function Scan_Whitespace(ByRef source As String) As Token
If source.Length > 0 AndAlso String.IsNullOrWhitespace(source(0)) Then
source = source.Remove(0, 1)
End If
Return Nothing
End Function
Lexical Analysis - Scanning
Now that all the definitions are defined as functions, the next step is to loop through the source code until the length of the source code is 0.
This will be done by creating a function that returns a fixed size collection of Tokens that has a single ByVal String parameter that represents the source code.
Inside the function, use a Do/Until loop to loop until the length of the source code is 0. The following code is the shell for our function:
Code:
Private Function Scan(ByVal source As String) As Token()
Do
Loop Until source.Length = 0
End Function
Inside the Do/Until loop we will want to check if the remaining source code matches any of the definitions we just created and the reason why I suggested making the return type of the whitespace definition a Token is because in Visual Basic .NET we can create a Delegate Function that groups all the definitions together. The following code should be placed outside of the function:
Code:
Private Delegate Function Scan_Definition(ByRef source As String) As Token
Now outside of the Do/Until loop in the Scan function, declare:
- A dynamic sized collection variable to hold the scanned tokens
- A fixed sized collection variable to hold all the definitions
- A placeholder token that will be reused within the Do/Until loop
The following code is still not the final product of the Scan function but should be reflective of the code used so far:
Code:
Private Function Scan(ByVal source As String) As Token()
Dim lexedTokens As New List(Of Token)
Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
Dim t As Token = Nothing
Do
Loop Until source.Length = 0
End Function
Now inside of the Do/Until loop, loop through each definition and assign the token placeholder equal to the result of the function. If the token is not a null result then add it to the collection to be returned.
Finally return the collection as a fixed sized collection. The following code is the final draft of the Scan function:
Code:
Private Function Scan(ByVal source As String) As Token()
Dim lexedTokens As New List(Of Token)
Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
Dim t As Token = Nothing
Do
For Each definition As Scan_Definition In definitions
t = definition.Invoke(source)
If t IsNot Nothing Then
lexedTokens.Add(t)
End If
Next
Loop Until source.Length = 0
Return lexedTokens.ToArray()
End Function
Notes
As you can tell the process of lexical analysis is fairly simple, and this process that I have shown is universal. You could do the same thing for just about any other language that you need to scan.
One thing to point out is that there is no prioritization in JSON. In some language, there is an identifier category in which can match several different categories if all else fails. In scenarios such as these, you would stop iterating through each definition and then let the Do/Until loop repeat to start the process over.