Results 1 to 4 of 4

Thread: [VB6] RegEx - Regular expression engine

  1. #1

    Thread Starter
    New Member nextyamb's Avatar
    Join Date
    Apr 2021
    Posts
    13

    [VB6] RegEx - Regular expression engine

    - RegEx engine compiles a regular expression string into a VM bytecode.
    - It was inspired by the article by Russ Cox.
    - Some features have been added and others have been changed.
    - Non-recursive implementation, except for lookahead.
    - This RegEx suports:

    Name:  Regex.jpg
Views: 93
Size:  23.8 KB

    1. Special characters, metacharacters:

    \t, \r, \v, \n, \f, \_ (Space)
    \+, \-, \*, \(, \) ...
    \xHH, HH is a two digit hex number

    2. Character classes:

    \d, \D, \w, \W, \s, \S

    3. Custom character classes (character classes are not allowed):

    [xyz], [^xyz], [+*/-], [_a-fA-F0-9], . (Dot)

    4. Anchors:

    ^, $

    5. Quantifiers:

    Greedy, Lazy (?) and Possessive (+)

    6. Alternation:

    A|B|AB

    7. Grouping (non-captured and captured):

    (?:...), (...)

    8. Positive and negative lookahead:

    (?=...), (?!...)

    9. Atomic grouping:

    (?>...)

    10. Backreference

    \1..\9

    11. NEW: Regex search & replace:

    (?$C regex), \i (one char back), \I (Back to start)

    Example:
    ((?$S 0S?1)\i\i | . )+
    recognizes the language: {0^n 1^n}, n>0

    12.NEW: Keep the matched text, \K, \O:

    "\K [^"]* \O"

    - Caution! This software is intended for educational and experimental purposes and has not been sufficiently tested. There are no warranties of any kind!




    - Usage: Insert a regex and a text. Then press buttons: Translate, Compile and Run
    Attached Files Attached Files
    Last edited by nextyamb; May 6th, 2026 at 06:19 AM. Reason: Software improvment

  2. #2

    Thread Starter
    New Member nextyamb's Avatar
    Join Date
    Apr 2021
    Posts
    13

    Re: [VB6] RegEx - Regular expression engine

    About the program:

    * The set of functions in the 'R2C.bas' module translates a regular expression into complete textual VM code with all the required labels. This is the most demanding part of the project. This part of the code is executed by pressing the 'Translate' button.
    * The virtual machine has about a dozen opcodes (character and range matching; the Split function which, for better readability, is divided into two subfunctions: Try and Defer; Jump; Match; Subm, ... ) and operates with labels. Execution is non-recursive, except for the Lookahead opcode, which is (for now) recursive.
    * The textual VM code must then be translated into an array of commands (the 'Compile' button), and finally the execution of the compiled VM code is started with the 'Run' button. If the regular expression is not changed, but only the searched text is modified, recompilation is not necessary and execution can proceed directly with the 'Run' command.



    2204:
    - Initial release
    2904:
    - Two bugs found: (: ...) is now (?:...) and range 0-9 is ok (gives 0-8 in previous version)
    - All possessive quantifiers changed.
    - Added atomic groups (?>...).
    Attached Files Attached Files

  3. #3

    Thread Starter
    New Member nextyamb's Avatar
    Join Date
    Apr 2021
    Posts
    13

    Re: [VB6] RegEx - Regular expression engine

    Virtual Machine opcodes:

    0. Char arg, where arg is a character or a number of two or more digits.
    If the current input character is c, the command succeeds. Otherwise, the VM enters the Fail state.

    Char x , recognizes char x
    Char 9 , recognizes digit 9
    Char 09 , recognizes Tab character, Chr(9)
    Char 32 , recognizes space

    1. Range n1 n2, where n1, n2 are ASCII values.

    If the ASCII code of the current character is in the interval [n1, n2), the command succeeds, otherwise it fails.

    2. IfR n1 n2 L1, where n1 < n2 (n1, n2 - ASCII values), L1 is a label.

    If the ASCII code of the current character is within the interval [n1, n2), execution jumps to the line starting with label L1 and the input advances to the next character. Otherwise, the VM continues execution with the next line.

    3. If c L1, where c is a character or a number of two or more digits. L1 is a label.

    If the input character is c, execution jumps to the line starting with label L1 and the input advances to the next character. Otherwise, the VM continues execution with the next line (the VM does not enter the Fail state as in the Char command).

    4. Jump L1

    An unconditional jump that transfers script execution to the line starting with label L1.

    5. Match

    A command that successfully terminates the script and outputs the current position of the input character.

    6. Try L1

    A branching command that attempts to reach the Match command by jumping to the line labeled L1.
    If it succeeds, execution terminates successfully. If it fails, execution continues with the next line.

    7. Defer L1

    A branching command that attempts to reach the Match command by continuing execution from the next line.
    If it succeeds, execution terminates successfully. If it fails, execution continues by jumping to the line starting with label L1.

    8. Subm n

    If n>0, stores the current position of the searched string at the array Tsubm (temporary). It is used to obtain intermediate results in case of successful execution. If n<0, stores the

    current position at the array subm (commit).

    9. Backref n

    For \1..\9. Try to match the same text that was previously captured by a 'Subm n' opcode.

    10. Look(+-) L1

    Lookahead opcode. Look+ L1 or Look- L1 provide lookahead capability.

    11. Open

    Opens a new no-backtrack session. A fail event will close it.

    12. Clear

    Clear the stack to the position where the last 'Open' pointed.


    Translating regex into VM code:

    Code:
    Regex		VM code
    ===============================
    c	    	Char c
    $		Char 00
    \x20		Char &H20
    .		Range 1 256
    -------------------------------
    e1e2	    	e1 code
    	    	e2 code
    -------------------------------
    e1|e2	    	Defer S1
    	    	e1 code
    	    	Jump S2
    	   S1:  e2 code
    	   S2:
    ===============================
    
    Character classes:
    
    [0-9a-z_]	IfR 48 58 S1
    		IfR 97 123 S1
    		If _ S1
    		Char 256 (fail) 
    	   S1:
    
    [^0-9+-]	IfR 48 58 S1
    		If + S1
    		If - S1
    		Jump S2
    	   S1:  Char 256 (fail) 
    	   S2:  Range 1 256
    
    ===============================
    
    Quantifiers:
    
    Range Quantifiers not implemented yet
    
    a) Greedy
    
    e?	    	Defer S1
    	    	e code
    	   S1:
    -------------------------------
    e*	   S1:  Defer S2
    	    	e code
    	    	Jump S1
    	   S2:
    -------------------------------
    e+	   S1:  e code
    	    	Try S1
    
    b) Lazy 
    
    e??	    	Try S1
    	    	e code
    	   S1:
    -------------------------------
    e*?	   S1:  Try S2
    	    	e code
    	    	Jump S1
    	   S2:
    -------------------------------
    e+?	   S1:  e code
    	    	Defer S1
    
    c) Possessive
    
    e?+		Open
    	    	Defer S1
    	    	e code
    		Clear
    	   S1:
    -------------------------------
    e*+		Open
    	   S1:  Defer S2
    	    	e code
    		Clear
    	    	Jump S1
    	   S2:
    -------------------------------
    e++	   	Open  
    	   S1:	e code
    		Clear
    	    	Try S1
    ===============================
    
    Lookahead:
    
    (?=e)		Look+ S1
    		e code
    		Match
    	   S1: 
    
    (?!e)		Look- S1
    		e code
    		Match
    	   S1: 
    
    ===============================
    
    Atomic groups:
    
    (?>e)		Open
    		e code
    		Clear
    
    Ineffective.
    Use possessive quantifiers instead.
    
    ===============================
    
    Grouping:
    
    (e)		Subm 2
    		e code
    		Subm -3
    
    Subm 0..1, reserved for matched text
    
    \K, \O example:
    
    ("\K.+\O")	Subm 2
    		Char "
    		Subm 2
    	   A0:  Range 1 256
    		Try A0
    		Subm 3
    		Char "
    		Subm -3
    A grammar for non-empty regular expressions:

    a -> '|'
    q ->[+*?]
    c -> char

    R -> T(aT)*
    T -> F+
    F ->Aq?
    A -> c | '(' R ')'

    Transcribed, concise grammar, used for regex recognition and translation:

    R -> (cq? | Rq?)+ | '(' R ')' | R(aR)+
    Last edited by nextyamb; May 3rd, 2026 at 05:50 AM.

  4. #4

    Thread Starter
    New Member nextyamb's Avatar
    Join Date
    Apr 2021
    Posts
    13

    Re: [VB6] RegEx - Regular expression engine

    0605:
    - \xhh matches a character with a two digit hex code. Works on the entire regex.
    - Possessive quantifiers changed again.
    - [abc\] bug is removed.
    - Added backreference: \1 .. \9
    - The error handler has been improved.
    - Improved GUI.
    Attached Files Attached Files

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width