- RegEx engine compiles a regular expression string into a VM bytecode.
- It was inspired by the article by Russ Cox.
- Some features have been added and others have been changed.
- Non-recursive implementation, except for lookahead.
- This RegEx suports:
1. Special characters, metacharacters:
\t, \r, \v, \n, \f, \_ (Space)
\+, \-, \*, \(, \) ...
\xHH, HH is a two digit hex number
2. Character classes:
\d, \D, \w, \W, \s, \S
3. Custom character classes (character classes are not allowed):
[xyz], [^xyz], [+*/-], [_a-fA-F0-9], . (Dot)
4. Anchors:
^, $
5. Quantifiers:
Greedy, Lazy (?) and Possessive (+)
6. Alternation:
A|B|AB
7. Grouping (non-captured and captured):
(?:...), (...)
8. Positive and negative lookahead:
(?=...), (?!...)
9. Atomic grouping:
(?>...)
10. Backreference
\1..\9
11. NEW: Regex search & replace:
(?$C regex), \i (one char back), \I (Back to start)
- Caution! This software is intended for educational and experimental purposes and has not been sufficiently tested. There are no warranties of any kind!
- Usage: Insert a regex and a text. Then press buttons: Translate, Compile and Run
Last edited by nextyamb; May 6th, 2026 at 06:19 AM.
Reason: Software improvment
* The set of functions in the 'R2C.bas' module translates a regular expression into complete textual VM code with all the required labels. This is the most demanding part of the project. This part of the code is executed by pressing the 'Translate' button.
* The virtual machine has about a dozen opcodes (character and range matching; the Split function which, for better readability, is divided into two subfunctions: Try and Defer; Jump; Match; Subm, ... ) and operates with labels. Execution is non-recursive, except for the Lookahead opcode, which is (for now) recursive.
* The textual VM code must then be translated into an array of commands (the 'Compile' button), and finally the execution of the compiled VM code is started with the 'Run' button. If the regular expression is not changed, but only the searched text is modified, recompilation is not necessary and execution can proceed directly with the 'Run' command.
2204:
- Initial release
2904:
- Two bugs found: (: ...) is now (?:...) and range 0-9 is ok (gives 0-8 in previous version)
- All possessive quantifiers changed.
- Added atomic groups (?>...).
0. Char arg, where arg is a character or a number of two or more digits.
If the current input character is c, the command succeeds. Otherwise, the VM enters the Fail state.
Char x , recognizes char x
Char 9 , recognizes digit 9
Char 09 , recognizes Tab character, Chr(9)
Char 32 , recognizes space
1. Range n1 n2, where n1, n2 are ASCII values.
If the ASCII code of the current character is in the interval [n1, n2), the command succeeds, otherwise it fails.
2. IfR n1 n2 L1, where n1 < n2 (n1, n2 - ASCII values), L1 is a label.
If the ASCII code of the current character is within the interval [n1, n2), execution jumps to the line starting with label L1 and the input advances to the next character. Otherwise, the VM continues execution with the next line.
3. If c L1, where c is a character or a number of two or more digits. L1 is a label.
If the input character is c, execution jumps to the line starting with label L1 and the input advances to the next character. Otherwise, the VM continues execution with the next line (the VM does not enter the Fail state as in the Char command).
4. Jump L1
An unconditional jump that transfers script execution to the line starting with label L1.
5. Match
A command that successfully terminates the script and outputs the current position of the input character.
6. Try L1
A branching command that attempts to reach the Match command by jumping to the line labeled L1.
If it succeeds, execution terminates successfully. If it fails, execution continues with the next line.
7. Defer L1
A branching command that attempts to reach the Match command by continuing execution from the next line.
If it succeeds, execution terminates successfully. If it fails, execution continues by jumping to the line starting with label L1.
8. Subm n
If n>0, stores the current position of the searched string at the array Tsubm (temporary). It is used to obtain intermediate results in case of successful execution. If n<0, stores the
current position at the array subm (commit).
9. Backref n
For \1..\9. Try to match the same text that was previously captured by a 'Subm n' opcode.
10. Look(+-) L1
Lookahead opcode. Look+ L1 or Look- L1 provide lookahead capability.
11. Open
Opens a new no-backtrack session. A fail event will close it.
12. Clear
Clear the stack to the position where the last 'Open' pointed.
Translating regex into VM code:
Code:
Regex VM code
===============================
c Char c
$ Char 00
\x20 Char &H20
. Range 1 256
-------------------------------
e1e2 e1 code
e2 code
-------------------------------
e1|e2 Defer S1
e1 code
Jump S2
S1: e2 code
S2:
===============================
Character classes:
[0-9a-z_] IfR 48 58 S1
IfR 97 123 S1
If _ S1
Char 256 (fail)
S1:
[^0-9+-] IfR 48 58 S1
If + S1
If - S1
Jump S2
S1: Char 256 (fail)
S2: Range 1 256
===============================
Quantifiers:
Range Quantifiers not implemented yet
a) Greedy
e? Defer S1
e code
S1:
-------------------------------
e* S1: Defer S2
e code
Jump S1
S2:
-------------------------------
e+ S1: e code
Try S1
b) Lazy
e?? Try S1
e code
S1:
-------------------------------
e*? S1: Try S2
e code
Jump S1
S2:
-------------------------------
e+? S1: e code
Defer S1
c) Possessive
e?+ Open
Defer S1
e code
Clear
S1:
-------------------------------
e*+ Open
S1: Defer S2
e code
Clear
Jump S1
S2:
-------------------------------
e++ Open
S1: e code
Clear
Try S1
===============================
Lookahead:
(?=e) Look+ S1
e code
Match
S1:
(?!e) Look- S1
e code
Match
S1:
===============================
Atomic groups:
(?>e) Open
e code
Clear
Ineffective.
Use possessive quantifiers instead.
===============================
Grouping:
(e) Subm 2
e code
Subm -3
Subm 0..1, reserved for matched text
\K, \O example:
("\K.+\O") Subm 2
Char "
Subm 2
A0: Range 1 256
Try A0
Subm 3
Char "
Subm -3
A grammar for non-empty regular expressions:
a -> '|'
q ->[+*?]
c -> char
R -> T(aT)*
T -> F+
F ->Aq?
A -> c | '(' R ')'
Transcribed, concise grammar, used for regex recognition and translation:
R -> (cq? | Rq?)+ | '(' R ')' | R(aR)+
Last edited by nextyamb; May 3rd, 2026 at 05:50 AM.
0605:
- \xhh matches a character with a two digit hex code. Works on the entire regex.
- Possessive quantifiers changed again.
- [abc\] bug is removed.
- Added backreference: \1 .. \9
- The error handler has been improved.
- Improved GUI.