-
Jun 5th, 2019, 07:37 AM
#1
Thread Starter
PowerPoster
MyInstr: Skip the contents of the quotes to find a substring
I need to search some substrings frequently in many strings, but I need to skip the contents of quotes (single quotes, double quotes, and back-quotes). For example:
String1 = "The title of the book is 'Harry Potter'. "
String2 = "Harry"
Then, the return value of MyInstr(1, String1, String2, vbTextCompare) should be 0.
Currently, I plan to compare and judge character by character, and I need to consider vbBinaryCompare and vbTextCompare. I wonder if there are some clever and efficient ways to achieve this? Thanks.
VB Code:
Public Function MyInstr(Start, S1, Optional S2, Optional ByVal Cmp As VbCompareMethode, _
Optional ByVal SkipQuotationCotent As Boolean = True) As Long
End Function
Last edited by dreammanor; Jun 11th, 2019 at 05:52 AM.
-
Jun 5th, 2019, 07:57 AM
#2
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
I need to search some substrings frequently in many strings, but I need to skip the contents of quotes ( single quotes, double quotes, and back-quotes). For example:
String1 = "The title of the book is ' Harry Potter'. "
String2 = "Harry"
Then, the return value of MyInstr(1, String1, String2, vbTextCompare) should be 0.
Currently, I plan to compare and judge character by character, and I need to consider vbBinaryCompare and vbTextCompare. I wonder if there are some clever and quick ways to achieve this? Thanks.
VB Code:
Public Function MyInstr(Start, S1, Optional S2, Optional ByVal Cmp As VbCompareMethode, _ Optional ByVal SkipQuotationCotent As Boolean = True) As Long End Function
Huh?
Why not just:
1) Search your String 1 for the (opening) quote. Result be saved in S (=Start)
2) Search your String 1 for the (closing) quote starting at S+1, being saved in E (=End)
3) Replace the String between S and E with BLANK (this includes the quotes)
3a) If you have more "quoted" strings, repeat 1), 2) and 3) until search for quotes returns 0 (or whatever value signaling "not found")
4) Run your If InStr=0 Then....
EDIT: To search for the Quotes i'd use the C-API-Function StrCSpnW/StrCSpnIW
Last edited by Zvoni; Jun 5th, 2019 at 08:04 AM.
Last edited by Zvoni; Tomorrow at 31:69 PM.
----------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
-
Jun 5th, 2019, 08:32 AM
#3
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Hi Zvoni, Your method is similar to my original idea: replacing the contents of the quotes with Chr(0). But I'd like to know if there is a more efficient way.
I'm going to find the information of StrCSpnW/StrCSpnIW now, thank you very much, Zvoni.
-
Jun 5th, 2019, 11:06 AM
#4
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Here's my kick at it...there are almost definitely bugs since I haven't tested it too thoroughly, but it might get you started (or give you a comparison approach for benchmarking against other approaches):
Code:
Option Explicit
Public Function MyInstr(ByVal Start As Long, _
ByVal S1 As String, _
ByVal S2 As String, _
Optional ByVal Cmp As VBA.VbCompareMethod = vbBinaryCompare, _
Optional ByVal SearchQuotedContent As Boolean = False) As Long
Dim ii As Long
Dim l1 As Long
Dim l2 As Long
Dim l_Char As Integer
Dim l_InQuote As Integer
Dim l_QuoteEnd As Long
l1 = Len(S1)
If l1 = 0 Then Exit Function ' Can't match empty string
l2 = Len(S2)
If l2 = 0 Then Exit Function ' Can't match empty string
If l1 < l2 Then Exit Function ' Can't find a longer string in a smaller string
If Start > l1 - l2 + 1 Then Exit Function ' Can't find if start is after end of string1 less the length of string2
l_QuoteEnd = Start ' Assume everything before Start is in quotes so we don't check it
If Not SearchQuotedContent Then
For ii = Start To l1
l_Char = AscW(Mid$(S1, ii, 1))
Select Case l_Char
Case 34, 39, 96 ' ", ', `
' Found a quote character
If l_InQuote Then
' We are already within a quoted block of text
If l_InQuote = l_Char Then
' and in a matching quote character
' So close off the quoted content run and remember the starting position of the unquoted run to come
l_InQuote = 0
l_QuoteEnd = ii + 1
End If
Else
' Entering quote - check previous non-quoted chunk to see if we have a match
l_InQuote = l_Char
If ii - l_QuoteEnd >= l2 Then
' The previous unquoted run is long enough for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, ii - l_QuoteEnd), S2, Cmp)
If MyInstr > 0 Then
' We found a match so short-circuit
Exit For
End If
End If
End If
End Select
Next ii
End If
If MyInstr = 0 Then
' No match so far
If Not l_InQuote Then
' We're not currently in a quoted run at the end of the string, so check the remaining characters
If l1 - l_QuoteEnd + 1 >= l2 Then
' There are enough remaining characters for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, l1 - l_QuoteEnd + 1), S2, Cmp)
End If
End If
End If
If MyInstr > 0 Then
If l_QuoteEnd > 0 Then
' Add position of closing quote to the matches starting character position
MyInstr = MyInstr + l_QuoteEnd - 1
End If
End If
End Function
Sub TestSpeed()
Dim ii As Long
Dim ll As Long
Dim d As Double
d = New_c.HPTimer
Do
ll = MyInstr(11, "Harry is 'cool'", "cool")
Debug.Assert ll = 0
ii = ii + 1
Loop While New_c.HPTimer - d < 1
MsgBox ii & " ops/s"
End Sub
I've tried to put reasonable short-circuits in to prevent unnecessary comparisons/processing, but that adds some complexity and I may have missed some edge cases where bugs may lurk.
When the SearchQuotedContent param is True you'd be better off just calling the VB Instr() method directly I think.
Last edited by jpbro; Jun 5th, 2019 at 03:04 PM.
Reason: Fixed a bug
-
Jun 5th, 2019, 11:35 AM
#5
Fanatic Member
Re: (String search algorithm) Skip the contents of the quotes to find a substring
According to the following test case, using vbBinaryCompare is 9 times faster than vbTextCompare:
VB Code:
Option Explicit Private Sub Form_Load() Dim t As Single Dim s As String Dim i As Long Dim pos As Long s = String(1000, 65) & "BCD" t = Timer For i = 1 To 1000000 pos = InStr(1, s, "BCD", vbTextCompare) Next Debug.Print "vbTextCompare: " & Timer - t t = Timer For i = 1 To 1000000 pos = InStr(1, s, "BCD", vbBinaryCompare) Next Debug.Print "vbBinaryCompare: " & Timer - t End Sub
Output:
vbTextCompare: 4.574219
vbBinaryCompare: 0.5273438
-
Jun 5th, 2019, 11:40 AM
#6
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Yikes! Instr with TextCompare is really slow. Check out VBSpeed for an Instr replacement that beats the pants off the native implementation: http://www.xbeat.net/vbspeed/c_InStr.htm
-
Jun 5th, 2019, 12:02 PM
#7
Re: (String search algorithm) Skip the contents of the quotes to find a substring
I would use the InStr method.
If a match is found then check position - 1 and position + length of search string for a '
-
Jun 5th, 2019, 05:36 PM
#8
Re: (String search algorithm) Skip the contents of the quotes to find a substring
TextCompare does a lot of extra work. It is not as simple as a case-insensitive compare, for example it respects ligatures.
For an English locale:
Code:
MsgBox InStr(1, "Abcœefg", "oe", vbTextCompare)
Displays 4, not 0. The 4th character is a ligature.
-
Jun 7th, 2019, 04:17 AM
#9
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Hi Zvoni, Your method is similar to my original idea: replacing the contents of the quotes with Chr(0). But I'd like to know if there is a more efficient way.
I'm going to find the information of StrCSpnW/StrCSpnIW now, thank you very much, Zvoni.
Don't!
Rather use vbNullString or the classic ""
Last edited by Zvoni; Tomorrow at 31:69 PM.
----------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
-
Jun 7th, 2019, 10:17 PM
#10
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by jpbro
Here's my kick at it...there are almost definitely bugs since I haven't tested it too thoroughly, but it might get you started (or give you a comparison approach for benchmarking against other approaches):
Code:
Option Explicit
Public Function MyInstr(ByVal Start As Long, _
ByVal S1 As String, _
ByVal S2 As String, _
Optional ByVal Cmp As VBA.VbCompareMethod = vbBinaryCompare, _
Optional ByVal SearchQuotedContent As Boolean = False) As Long
Dim ii As Long
Dim l1 As Long
Dim l2 As Long
Dim l_Char As Integer
Dim l_InQuote As Integer
Dim l_QuoteEnd As Long
l1 = Len(S1)
If l1 = 0 Then Exit Function ' Can't match empty string
l2 = Len(S2)
If l2 = 0 Then Exit Function ' Can't match empty string
If l1 < l2 Then Exit Function ' Can't find a longer string in a smaller string
If Start > l1 - l2 + 1 Then Exit Function ' Can't find if start is after end of string1 less the length of string2
l_QuoteEnd = Start ' Assume everything before Start is in quotes so we don't check it
If Not SearchQuotedContent Then
For ii = Start To l1
l_Char = AscW(Mid$(S1, ii, 1))
Select Case l_Char
Case 34, 39, 96 ' ", ', `
' Found a quote character
If l_InQuote Then
' We are already within a quoted block of text
If l_InQuote = l_Char Then
' and in a matching quote character
' So close off the quoted content run and remember the starting position of the unquoted run to come
l_InQuote = 0
l_QuoteEnd = ii + 1
End If
Else
' Entering quote - check previous non-quoted chunk to see if we have a match
l_InQuote = l_Char
If ii - l_QuoteEnd >= l2 Then
' The previous unquoted run is long enough for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, ii - l_QuoteEnd), S2, Cmp)
If MyInstr > 0 Then
' We found a match so short-circuit
Exit For
End If
End If
End If
End Select
Next ii
End If
If MyInstr = 0 Then
' No match so far
If Not l_InQuote Then
' We're not currently in a quoted run at the end of the string, so check the remaining characters
If l1 - l_QuoteEnd + 1 >= l2 Then
' There are enough remaining characters for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, l1 - l_QuoteEnd + 1), S2, Cmp)
End If
End If
End If
If MyInstr > 0 Then
If l_QuoteEnd > 0 Then
' Add position of closing quote to the matches starting character position
MyInstr = MyInstr + l_QuoteEnd - 1
End If
End If
End Function
Sub TestSpeed()
Dim ii As Long
Dim ll As Long
Dim d As Double
d = New_c.HPTimer
Do
ll = MyInstr(11, "Harry is 'cool'", "cool")
Debug.Assert ll = 0
ii = ii + 1
Loop While New_c.HPTimer - d < 1
MsgBox ii & " ops/s"
End Sub
I've tried to put reasonable short-circuits in to prevent unnecessary comparisons/processing, but that adds some complexity and I may have missed some edge cases where bugs may lurk.
When the SearchQuotedContent param is True you'd be better off just calling the VB Instr() method directly I think.
Hi jpbro, sorry for the late reply. I tested your code, it's 3 times faster than my method (replacing the contents of the quotes with Chr(0)). Thank you very much.
-
Jun 7th, 2019, 10:29 PM
#11
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by qvb6
According to the following test case, using vbBinaryCompare is 9 times faster than vbTextCompare:
VB Code:
Option Explicit Private Sub Form_Load() Dim t As Single Dim s As String Dim i As Long Dim pos As Long s = String(1000, 65) & "BCD" t = Timer For i = 1 To 1000000 pos = InStr(1, s, "BCD", vbTextCompare) Next Debug.Print "vbTextCompare: " & Timer - t t = Timer For i = 1 To 1000000 pos = InStr(1, s, "BCD", vbBinaryCompare) Next Debug.Print "vbBinaryCompare: " & Timer - t End Sub
Output:
vbTextCompare: 4.574219
vbBinaryCompare: 0.5273438
Thank you, qvb6.
Originally Posted by Arnoutdv
I would use the InStr method.
If a match is found then check position - 1 and position + length of search string for a '
Yes, InStr is the easiest and most effective way. Thank you, Arnoutdv.
Originally Posted by dilettante
TextCompare does a lot of extra work. It is not as simple as a case-insensitive compare, for example it respects ligatures.
For an English locale:
Code:
MsgBox InStr(1, "Abcœefg", "oe", vbTextCompare)
Displays 4, not 0. The 4th character is a ligature.
Thank you, dilettante. For TextCompare, my solution is to convert both S1 and S2 to lowercase, and then compare them with BinaryCompare.
Originally Posted by Zvoni
Don't!
Rather use vbNullString or the classic ""
Thank you, Zvoni. I decided to use jpbro's method, which is three times faster than my method (replacing the contents of the quotes with Chr(0)).
-
Jun 7th, 2019, 10:49 PM
#12
Re: (String search algorithm) Skip the contents of the quotes to find a substring
@DreamManor. One tweak you may want to consider... Test the string for a quote/apostrophe (InStr binary compare) before looping through the string. If most of your strings do not have quotes/apostrophes then that tweak should improve overall speed. If no special characters, then perform the InStr(String1,String2) immediately without looping.
Just a thought and an easy enough test...
Code:
Const VBquote = """"
Const VBapos = "'"
If InStr(1, String1, VBquote, vbBinaryCompare) = 0 Then
If InStr(1, String1, VBapos, vbBinaryCompare) = 0 Then
If InStr(1, String1, String2, vbTextCompare) Then
... match
Else
... no match
End If
Exit Sub
End If
End If
' ... do the loop
Swap the two tests around if you expect more strings with apostrophes than those with quotes
-
Jun 7th, 2019, 10:58 PM
#13
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Thank you, dilettante. For TextCompare, my solution is to convert both S1 and S2 to lowercase, and then compare them with BinaryCompare.
setting to UCase is faster
Code:
Private Type SearchTxtType
SearchFor As String
Found As Long
End Type
Private Sub Command1_Click()
Dim SearchTxt() As SearchTxtType
Dim i As Long
ReDim SearchTxt(3)
SearchTxt(0).SearchFor = "Hi"
SearchTxt(1).SearchFor = "with"
SearchTxt(2).SearchFor = "this"
SearchTxt(3).SearchFor = "'"
For i = 0 To UBound(SearchTxt)
SearchTxt(i).Found = CountStringInString(Text1.Text, SearchTxt(i).SearchFor, vbTextCompare)
Debug.Print SearchTxt(i).SearchFor, SearchTxt(i).Found
Next
End Sub
Private Sub Form_Load()
Dim FileNo As Integer
Dim TempData As String
FileNo = FreeFile
Open "E:\Testword.txt" For Input As FileNo
TempData = Input(LOF(FileNo), FileNo)
Close
Text1.Text = TempData
End Sub
Public Function CountStringInString(Text As String, SearchFor As String, _
Optional ComapareAsText As Boolean = False) As Long
Dim i As Long, j As Long, z As Long
Dim s As String, s1 As String
If ComapareAsText Then
s = UCase$(Text)
s1 = UCase$(SearchFor)
Else
s = Text
s1 = SearchFor
End If
i = 1
Do
j = InStr(i, s, s1, vbBinaryCompare)
If j = 0 Then
Exit Do
End If
i = j + Len(s1)
z = z + 1
Loop
CountStringInString = z
End Function
another option would be to use Regex to seperate the Parts in double quotes
and leave only the words
hth
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Jun 7th, 2019, 11:37 PM
#14
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by LaVolpe
@DreamManor. One tweak you may want to consider... Test the string for a quote/apostrophe (InStr binary compare) before looping through the string. If most of your strings do not have quotes/apostrophes then that tweak should improve overall speed. If no special characters, then perform the InStr(String1,String2) immediately without looping.
Just a thought and an easy enough test...
Code:
Const VBquote = """"
Const VBapos = "'"
If InStr(1, String1, VBquote, vbBinaryCompare) = 0 Then
If InStr(1, String1, VBapos, vbBinaryCompare) = 0 Then
If InStr(1, String1, String2, vbTextCompare) Then
... match
Else
... no match
End If
Exit Sub
End If
End If
' ... do the loop
Swap the two tests around if you expect more strings with apostrophes than those with quotes
Very helpful advice, thank you, LaVolpe. I'm currently working on HTML, CSS, JavaScript strings. I not only need to judge quotes (Chr(34), Chr(39), Chr(96)), I also need to judge the comment symbols ("//", "/* ... */") , so I need to use jpbro's method to scan every character.
-
Jun 7th, 2019, 11:48 PM
#15
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Very helpful advice, thank you, LaVolpe. I'm currently working on HTML, CSS, JavaScript strings. I not only need to judge quotes (Chr(34), Chr(39), Chr(96)), I also need to judge the comment symbols ("//", "/* ... */") , so I need to use jpbro's method to scan every character.
I see. Ignore the following if it doesn't apply...
Not sure how many strings you are talking about, i.e., parsing entire documents? If so, it may be much faster to use an overlay array and loop thru the array elements. The advantages can be significant:
- The array is an overlay, you don't do myArray()=theString. Requires CopyMemory and SafeArray structures & result is no copying of data which would be a speed hit
- The string characters and array data share the same binary information. You would be comparing numbers vs string characters when looping. Ultimately, you would use InStr() for comparison, but looping via the bytes. A speed hit by looping with string characters is the temporary creation of strings, i.e., Mid$(...), AscW(Mid$(...)), etc
Code:
For x = 1 To Len(String1)
If arrInts(x) = 34 Then
End If
Next
will be faster than
Code:
For x = 1 To Len(String1)
If AscW(Mid$(String, x, 1)) = 34 Then
End If
Next
The usage of arrays requires more work, but can really improve speed when parsing KBs or MBs of text.
-
Jun 8th, 2019, 12:25 AM
#16
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by ChrisE
setting to UCase is faster
Thank you, ChrisE.
Originally Posted by ChrisE
another option would be to use Regex to seperate the Parts in double quotes
and leave only the words
hth
I know that RegEx will have better flexibility and expandability, but RegEx always makes me a headache. If I don't use RegEx for 3 months, I'll forget all the rules of it. When I use it next time, I need to relearn it.
The following is the language syntax definition of Monaco Editor (most of which are RegEx expressions), which is shocking to me:
Code:
// Difficulty: "Nightmare!"
/*
Ruby language definition
Quite a complex language due to elaborate escape sequences
and quoting of literate strings/regular expressions, and
an 'end' keyword that does not always apply to modifiers like until and while,
and a 'do' keyword that sometimes starts a block, but sometimes is part of
another statement (like 'while').
(1) end blocks:
'end' may end declarations like if or until, but sometimes 'if' or 'until'
are modifiers where there is no 'end'. Also, 'do' sometimes starts a block
that is ended by 'end', but sometimes it is part of a 'while', 'for', or 'until'
To do proper brace matching we do some elaborate state manipulation.
some examples:
until bla do
work until tired
list.each do
foo if test
end
end
or
if test
foo (if test then x end)
bar if bla
end
or, how about using class as a property..
class Foo
def endpoint
self.class.endpoint || routes
end
end
(2) quoting:
there are many kinds of strings and escape sequences. But also, one can
start many string-like things as '%qx' where q specifies the kind of string
(like a command, escape expanded, regular expression, symbol etc.), and x is
some character and only another 'x' ends the sequence. Except for brackets
where the closing bracket ends the sequence.. and except for a nested bracket
inside the string like entity. Also, such strings can contain interpolated
ruby expressions again (and span multiple lines). Moreover, expanded
regular expression can also contain comments.
*/
return {
tokenPostfix: '.ruby',
keywords: [
'__LINE__', '__ENCODING__', '__FILE__', 'BEGIN', 'END', 'alias', 'and', 'begin',
'break', 'case', 'class', 'def', 'defined?', 'do', 'else', 'elsif', 'end',
'ensure', 'for', 'false', 'if', 'in', 'module', 'next', 'nil', 'not', 'or', 'redo',
'rescue', 'retry', 'return', 'self', 'super', 'then', 'true', 'undef', 'unless',
'until', 'when', 'while', 'yield',
],
keywordops: [
'::', '..', '...', '?', ':', '=>'
],
builtins: [
'require', 'public', 'private', 'include', 'extend', 'attr_reader',
'protected', 'private_class_method', 'protected_class_method', 'new'
],
// these are closed by 'end' (if, while and until are handled separately)
declarations: [
'module', 'class', 'def', 'case', 'do', 'begin', 'for', 'if', 'while', 'until', 'unless'
],
linedecls: [
'def', 'case', 'do', 'begin', 'for', 'if', 'while', 'until', 'unless'
],
operators: [
'^', '&', '|', '<=>', '==', '===', '!~', '=~', '>', '>=', '<', '<=', '<<', '>>', '+',
'-', '*', '/', '%', '**', '~', '+@', '-@', '[]', '[]=', '`',
'+=', '-=', '*=', '**=', '/=', '^=', '%=', '<<=', '>>=', '&=', '&&=', '||=', '|='
],
brackets: [
{ open: '(', close: ')', token: 'delimiter.parenthesis' },
{ open: '{', close: '}', token: 'delimiter.curly' },
{ open: '[', close: ']', token: 'delimiter.square' }
],
// we include these common regular expressions
symbols: /[=><!~?:&|+\-*\/\^%\.]+/,
// escape sequences
escape: /(?:[abefnrstv\\"'\n\r]|[0-7]{1,3}|x[0-9A-Fa-f]{1,2}|u[0-9A-Fa-f]{4})/,
escapes: /\\(?:C\-(@escape|.)|c(@escape|.)|@escape)/,
decpart: /\d(_?\d)*/,
decimal: /0|@decpart/,
delim: /[^a-zA-Z0-9\s\n\r]/,
heredelim: /(?:\w+|'[^']*'|"[^"]*"|`[^`]*`)/,
regexpctl: /[(){}\[\]\$\^|\-*+?\.]/,
regexpesc: /\\(?:[AzZbBdDfnrstvwWn0\\\/]|@regexpctl|c[A-Z]|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})?/,
// The main tokenizer for our languages
tokenizer: {
// Main entry.
// root.<decl> where decl is the current opening declaration (like 'class')
root: [
// identifiers and keywords
// most complexity here is due to matching 'end' correctly with declarations.
// We distinguish a declaration that comes first on a line, versus declarations further on a line (which are most likey modifiers)
[/^(\s*)([a-z_]\w*[!?=]?)/, ['white',
{
cases: {
'for|until|while': { token: 'keyword.$2', next: '@dodecl.$2' },
'@declarations': { token: 'keyword.$2', next: '@root.$2' },
'end': { token: 'keyword.$S2', next: '@pop' },
'@keywords': 'keyword',
'@builtins': 'predefined',
'@default': 'identifier'
}
}]],
[/[a-z_]\w*[!?=]?/,
{
cases: {
'if|unless|while|until': { token: 'keyword.$0x', next: '@modifier.$0x' },
'for': { token: 'keyword.$2', next: '@dodecl.$2' },
'@linedecls': { token: 'keyword.$0', next: '@root.$0' },
'end': { token: 'keyword.$S2', next: '@pop' },
'@keywords': 'keyword',
'@builtins': 'predefined',
'@default': 'identifier'
}
}],
[/[A-Z][\w]*[!?=]?/, 'constructor.identifier'], // constant
[/\$[\w]*/, 'global.constant'], // global
[/@[\w]*/, 'namespace.instance.identifier'], // instance
[/@@[\w]*/, 'namespace.class.identifier'], // class
// here document
[/<<[-~](@heredelim).*/, { token: 'string.heredoc.delimiter', next: '@heredoc.$1' }],
[/[ \t\r\n]+<<(@heredelim).*/, { token: 'string.heredoc.delimiter', next: '@heredoc.$1' }],
[/^<<(@heredelim).*/, { token: 'string.heredoc.delimiter', next: '@heredoc.$1' }],
// whitespace
{ include: '@whitespace' },
// strings
[/"/, { token: 'string.d.delim', next: '@dstring.d."' }],
[/'/, { token: 'string.sq.delim', next: '@sstring.sq' }],
// % literals. For efficiency, rematch in the 'pstring' state
[/%([rsqxwW]|Q?)/, { token: '@rematch', next: 'pstring' }],
// commands and symbols
[/`/, { token: 'string.x.delim', next: '@dstring.x.`' }],
[/:(\w|[$@])\w*[!?=]?/, 'string.s'],
[/:"/, { token: 'string.s.delim', next: '@dstring.s."' }],
[/:'/, { token: 'string.s.delim', next: '@sstring.s' }],
// regular expressions. Lookahead for a (not escaped) closing forwardslash on the same line
[/\/(?=(\\\/|[^\/\n])+\/)/, { token: 'regexp.delim', next: '@regexp' }],
// delimiters and operators
[/[{}()\[\]]/, '@brackets'],
[/@symbols/, {
cases: {
'@keywordops': 'keyword',
'@operators': 'operator',
'@default': ''
}
}],
[/[;,]/, 'delimiter'],
// numbers
[/0[xX][0-9a-fA-F](_?[0-9a-fA-F])*/, 'number.hex'],
[/0[_oO][0-7](_?[0-7])*/, 'number.octal'],
[/0[bB][01](_?[01])*/, 'number.binary'],
[/0[dD]@decpart/, 'number'],
[/@decimal((\.@decpart)?([eE][\-+]?@decpart)?)/, {
cases: {
'$1': 'number.float',
'@default': 'number'
}
}],
],
// used to not treat a 'do' as a block opener if it occurs on the same
// line as a 'do' statement: 'while|until|for'
// dodecl.<decl> where decl is the declarations started, like 'while'
dodecl: [
[/^/, { token: '', switchTo: '@root.$S2' }], // get out of do-skipping mode on a new line
[/[a-z_]\w*[!?=]?/, {
cases: {
'end': { token: 'keyword.$S2', next: '@pop' }, // end on same line
'do': { token: 'keyword', switchTo: '@root.$S2' }, // do on same line: not an open bracket here
'@linedecls': { token: '@rematch', switchTo: '@root.$S2' }, // other declaration on same line: rematch
'@keywords': 'keyword',
'@builtins': 'predefined',
'@default': 'identifier'
}
}],
{ include: '@root' }
],
// used to prevent potential modifiers ('if|until|while|unless') to match
// with 'end' keywords.
// modifier.<decl>x where decl is the declaration starter, like 'if'
modifier: [
[/^/, '', '@pop'], // it was a modifier: get out of modifier mode on a new line
[/[a-z_]\w*[!?=]?/, {
cases: {
'end': { token: 'keyword.$S2', next: '@pop' }, // end on same line
'then|else|elsif|do': { token: 'keyword', switchTo: '@root.$S2' }, // real declaration and not a modifier
'@linedecls': { token: '@rematch', switchTo: '@root.$S2' }, // other declaration => not a modifier
'@keywords': 'keyword',
'@builtins': 'predefined',
'@default': 'identifier'
}
}],
{ include: '@root' }
],
// single quote strings (also used for symbols)
// sstring.<kind> where kind is 'sq' (single quote) or 's' (symbol)
sstring: [
[/[^\\']+/, 'string.$S2'],
[/\\\\|\\'|\\$/, 'string.$S2.escape'],
[/\\./, 'string.$S2.invalid'],
[/'/, { token: 'string.$S2.delim', next: '@pop' }]
],
// double quoted "string".
// dstring.<kind>.<delim> where kind is 'd' (double quoted), 'x' (command), or 's' (symbol)
// and delim is the ending delimiter (" or `)
dstring: [
[/[^\\`"#]+/, 'string.$S2'],
[/#/, 'string.$S2.escape', '@interpolated'],
[/\\$/, 'string.$S2.escape'],
[/@escapes/, 'string.$S2.escape'],
[/\\./, 'string.$S2.escape.invalid'],
[/[`"]/, {
cases: {
'$#==$S3': { token: 'string.$S2.delim', next: '@pop' },
'@default': 'string.$S2'
}
}]
],
// literal documents
// heredoc.<close> where close is the closing delimiter
heredoc: [
[/^(\s*)(@heredelim)$/, {
cases: {
'$2==$S2': ['string.heredoc', { token: 'string.heredoc.delimiter', next: '@pop' }],
'@default': ['string.heredoc', 'string.heredoc']
}
}],
[/.*/, 'string.heredoc'],
],
// interpolated sequence
interpolated: [
[/\$\w*/, 'global.constant', '@pop'],
[/@\w*/, 'namespace.class.identifier', '@pop'],
[/@@\w*/, 'namespace.instance.identifier', '@pop'],
[/[{]/, { token: 'string.escape.curly', switchTo: '@interpolated_compound' }],
['', '', '@pop'], // just a # is interpreted as a #
],
// any code
interpolated_compound: [
[/[}]/, { token: 'string.escape.curly', next: '@pop' }],
{ include: '@root' },
],
// %r quoted regexp
// pregexp.<open>.<close> where open/close are the open/close delimiter
pregexp: [
{ include: '@whitespace' },
// turns out that you can quote using regex control characters, aargh!
// for example; %r|kgjgaj| is ok (even though | is used for alternation)
// so, we need to match those first
[/[^\(\{\[\\]/, {
cases: {
'$#==$S3': { token: 'regexp.delim', next: '@pop' },
'$#==$S2': { token: 'regexp.delim', next: '@push' }, // nested delimiters are allowed..
'~[)}\\]]': '@brackets.regexp.escape.control',
'~@regexpctl': 'regexp.escape.control',
'@default': 'regexp'
}
}],
{ include: '@regexcontrol' },
],
// We match regular expression quite precisely
regexp: [
{ include: '@regexcontrol' },
[/[^\\\/]/, 'regexp'],
['/[ixmp]*', { token: 'regexp.delim' }, '@pop'],
],
regexcontrol: [
[/(\{)(\d+(?:,\d*)?)(\})/, ['@brackets.regexp.escape.control', 'regexp.escape.control', '@brackets.regexp.escape.control']],
[/(\[)(\^?)/, ['@brackets.regexp.escape.control', { token: 'regexp.escape.control', next: '@regexrange' }]],
[/(\()(\?[:=!])/, ['@brackets.regexp.escape.control', 'regexp.escape.control']],
[/\(\?#/, { token: 'regexp.escape.control', next: '@regexpcomment' }],
[/[()]/, '@brackets.regexp.escape.control'],
[/@regexpctl/, 'regexp.escape.control'],
[/\\$/, 'regexp.escape'],
[/@regexpesc/, 'regexp.escape'],
[/\\\./, 'regexp.invalid'],
[/#/, 'regexp.escape', '@interpolated'],
],
regexrange: [
[/-/, 'regexp.escape.control'],
[/\^/, 'regexp.invalid'],
[/\\$/, 'regexp.escape'],
[/@regexpesc/, 'regexp.escape'],
[/[^\]]/, 'regexp'],
[/\]/, '@brackets.regexp.escape.control', '@pop'],
],
regexpcomment: [
[/[^)]+/, 'comment'],
[/\)/, { token: 'regexp.escape.control', next: '@pop' }]
],
// % quoted strings
// A bit repetitive since we need to often special case the kind of ending delimiter
pstring: [
[/%([qws])\(/, { token: 'string.$1.delim', switchTo: '@qstring.$1.(.)' }],
[/%([qws])\[/, { token: 'string.$1.delim', switchTo: '@qstring.$1.[.]' }],
[/%([qws])\{/, { token: 'string.$1.delim', switchTo: '@qstring.$1.{.}' }],
[/%([qws])</, { token: 'string.$1.delim', switchTo: '@qstring.$1.<.>' }],
[/%([qws])(@delim)/, { token: 'string.$1.delim', switchTo: '@qstring.$1.$2.$2' }],
[/%r\(/, { token: 'regexp.delim', switchTo: '@pregexp.(.)' }],
[/%r\[/, { token: 'regexp.delim', switchTo: '@pregexp.[.]' }],
[/%r\{/, { token: 'regexp.delim', switchTo: '@pregexp.{.}' }],
[/%r</, { token: 'regexp.delim', switchTo: '@pregexp.<.>' }],
[/%r(@delim)/, { token: 'regexp.delim', switchTo: '@pregexp.$1.$1' }],
[/%(x|W|Q?)\(/, { token: 'string.$1.delim', switchTo: '@qqstring.$1.(.)' }],
[/%(x|W|Q?)\[/, { token: 'string.$1.delim', switchTo: '@qqstring.$1.[.]' }],
[/%(x|W|Q?)\{/, { token: 'string.$1.delim', switchTo: '@qqstring.$1.{.}' }],
[/%(x|W|Q?)</, { token: 'string.$1.delim', switchTo: '@qqstring.$1.<.>' }],
[/%(x|W|Q?)(@delim)/, { token: 'string.$1.delim', switchTo: '@qqstring.$1.$2.$2' }],
[/%([rqwsxW]|Q?)./, { token: 'invalid', next: '@pop' }], // recover
[/./, { token: 'invalid', next: '@pop' }], // recover
],
// non-expanded quoted string.
// qstring.<kind>.<open>.<close>
// kind = q|w|s (single quote, array, symbol)
// open = open delimiter
// close = close delimiter
qstring: [
[/\\$/, 'string.$S2.escape'],
[/\\./, 'string.$S2.escape'],
[/./, {
cases: {
'$#==$S4': { token: 'string.$S2.delim', next: '@pop' },
'$#==$S3': { token: 'string.$S2.delim', next: '@push' }, // nested delimiters are allowed..
'@default': 'string.$S2'
}
}],
],
// expanded quoted string.
// qqstring.<kind>.<open>.<close>
// kind = Q|W|x (double quote, array, command)
// open = open delimiter
// close = close delimiter
qqstring: [
[/#/, 'string.$S2.escape', '@interpolated'],
{ include: '@qstring' }
],
// whitespace & comments
whitespace: [
[/[ \t\r\n]+/, ''],
[/^\s*=begin\b/, 'comment', '@comment'],
[/#.*$/, 'comment'],
],
comment: [
[/[^=]+/, 'comment'],
[/^\s*=begin\b/, 'comment.invalid'], // nested comment
[/^\s*=end\b.*/, 'comment', '@pop'],
[/[=]/, 'comment']
],
}
};
Last edited by dreammanor; Jun 8th, 2019 at 12:40 AM.
-
Jun 8th, 2019, 12:44 AM
#17
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by LaVolpe
I see. Ignore the following if it doesn't apply...
Not sure how many strings you are talking about, i.e., parsing entire documents? If so, it may be much faster to use an overlay array and loop thru the array elements. The advantages can be significant:
- The array is an overlay, you don't do myArray()=theString. Requires CopyMemory and SafeArray structures & result is no copying of data which would be a speed hit
- The string characters and array data share the same binary information. You would be comparing numbers vs string characters when looping. Ultimately, you would use InStr() for comparison, but looping via the bytes. A speed hit by looping with string characters is the temporary creation of strings, i.e., Mid$(...), AscW(Mid$(...)), etc
Code:
For x = 1 To Len(String1)
If arrInts(x) = 34 Then
End If
Next
will be faster than
Code:
For x = 1 To Len(String1)
If AscW(Mid$(String, x, 1)) = 34 Then
End If
Next
The usage of arrays requires more work, but can really improve speed when parsing KBs or MBs of text.
Yes, I need to parse the entire document. After I have completed the entire parsing algorithm, I'll try to use CopyMemory and SafeArray structures to further improve the software performance. Thank you very much, LaVolpe.
-
Jun 8th, 2019, 07:31 AM
#18
Fanatic Member
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
InStrB is slightly faster than InStr(vbBinaryCompare). In the test case in post #5, I got 0.4 Seconds, so it's good for searching for quotes, but the position you get is in Bytes.
-
Jun 8th, 2019, 09:24 AM
#19
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
There is one potential issue in my code that you will need to be aware of, and may need to fix/change the behaviour. It has to do with setting the Start parameter to a value inside a quoted run of text. This can produce possible unexpected/undesired results.
Consider the following example:
Code:
Debug.Print MyInstr(2, "'Harry' is 'cool'", "Harry")
That will return 2 (though you may expect 0) because my routine does no back checking to see if it is in a string - it only scans in the forward direction and the scanning has been instructed to begin after the opening apostrophe.
Likewise, you might expect Debug.Print MyInstr(2, "'Harry' is 'cool'", "is") to return 9, but it returns 0.
I don't have a solution for this right now, just wanted to bring it to your attention. I think you'll always have to start the scan at the beginning of the string and only start looking for matches once the Start parameter value has been passed and you are outside a quote block.
Also I agree with LaVolpe that mapping the string to an array (as discussed in an earlier thread of yours) would be an good optimization. I didn't go that far with my example because I wanted to take a quick hack at the logic.
-
Jun 8th, 2019, 09:49 AM
#20
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
@jpbro & dreammanor. The logic can get more complicated when special characters are not matched/paired,
i.e. "Harry's car is cool"
-
Jun 9th, 2019, 03:31 AM
#21
Thread Starter
PowerPoster
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
Originally Posted by qvb6
InStrB is slightly faster than InStr(vbBinaryCompare). In the test case in post #5, I got 0.4 Seconds, so it's good for searching for quotes, but the position you get is in Bytes.
If InstrB is used, the judgment of Chinese characters and Unicode characters will become complicated.
Originally Posted by LaVolpe
@jpbro & dreammanor. The logic can get more complicated when special characters are not matched/paired,
i.e. "Harry's car is cool"
Yes, I've made some additions and enhancements to jpbro's code, and the logic of the code has become a bit complicated, but it is still faster than my original method (replacing the contents of the quotes with Chr(0)).
Last edited by dreammanor; Jun 9th, 2019 at 06:55 PM.
-
Jun 9th, 2019, 03:33 AM
#22
Thread Starter
PowerPoster
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
Originally Posted by jpbro
There is one potential issue in my code that you will need to be aware of, and may need to fix/change the behaviour. It has to do with setting the Start parameter to a value inside a quoted run of text. This can produce possible unexpected/undesired results.
Consider the following example:
Code:
Debug.Print MyInstr(2, "'Harry' is 'cool'", "Harry")
That will return 2 (though you may expect 0) because my routine does no back checking to see if it is in a string - it only scans in the forward direction and the scanning has been instructed to begin after the opening apostrophe.
Likewise, you might expect Debug.Print MyInstr(2, "'Harry' is 'cool'", "is") to return 9, but it returns 0.
I don't have a solution for this right now, just wanted to bring it to your attention. I think you'll always have to start the scan at the beginning of the string and only start looking for matches once the Start parameter value has been passed and you are outside a quote block.
Also I agree with LaVolpe that mapping the string to an array (as discussed in an earlier thread of yours) would be an good optimization. I didn't go that far with my example because I wanted to take a quick hack at the logic.
Hi jpbro, I modified your code, now MyInstr can return the correct results, but as LaVolpe said, the logic of the code becomes a bit complicated, but it is still faster than my original method (replacing the contents of the quotes with Chr(0)). Thank you very much.
Debug.Print MyInstr(2, "'Harry' is 'cool'", "Harry") ==> 0
Debug.Print MyInstr(2, "'Harry' is 'cool'", "is") ==> 9
Debug.Print MyInstr(1, "Harry's car is cool", "Harry") ==> 1
Code:
Public Function MyInstr(ByVal Start As Long, _
ByVal S1 As String, _
ByVal S2 As String, _
Optional ByVal Cmp As VBA.VbCompareMethod = vbBinaryCompare, _
Optional ByVal SearchQuotedContent As Boolean = False) As Long
Dim ii As Long
Dim l1 As Long
Dim l2 As Long
Dim l_Char As Integer
Dim l_InQuote As Integer
Dim l_QuoteEnd As Long
Dim l_FirstChar As Integer
Dim l_S3 As String
Dim l_Pos As Long
l1 = Len(S1)
If l1 = 0 Then Exit Function ' Can't match empty string
l2 = Len(S2)
If l2 = 0 Then Exit Function ' Can't match empty string
If l1 < l2 Then Exit Function ' Can't find a longer string in a smaller string
If Start > l1 - l2 + 1 Then Exit Function ' Can't find if start is after end of string1 less the length of string2
'--- DreamManor Added on 2019-06-08 -------------------------------------------
l_FirstChar = AscW(Left$(S2, 1))
If Cmp <> vbBinaryCompare Then
l_S3 = UCase(S2)
End If
If Start > 1 And Not SearchQuotedContent Then
l_Pos = MyInstr(1, S1, S2, Cmp, SearchQuotedContent)
If l_Pos = 0 Then Exit Function
Do While l_Pos < Start
l_Pos = MyInstr(l_Pos + 1, S1, S2, Cmp, SearchQuotedContent)
If l_Pos = 0 Then Exit Function
Loop
MyInstr = l_Pos
Exit Function
End If
'---------------------------------------------------------------------------------
l_QuoteEnd = Start ' Assume everything before Start is in quotes so we don't check it
If Not SearchQuotedContent Then
For ii = Start To l1
l_Char = AscW(Mid$(S1, ii, 1))
Select Case l_Char
Case 34, 39, 96 ' ", ', `
' Found a quote character
If l_InQuote Then
' We are already within a quoted block of text
If l_InQuote = l_Char Then
' and in a matching quote character
' So close off the quoted content run and remember the starting position of the unquoted run to come
l_InQuote = 0
l_QuoteEnd = ii + 1
End If
Else
' Entering quote - check previous non-quoted chunk to see if we have a match
l_InQuote = l_Char
If ii - l_QuoteEnd >= l2 Then
' The previous unquoted run is long enough for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, ii - l_QuoteEnd), S2, Cmp)
If MyInstr > 0 Then
' We found a match so short-circuit
Exit For
End If
End If
End If
Case l_FirstChar
'--- DreamManor Added on 2019-06-08 -----------------
If l_InQuote = 0 Then
If Cmp = vbBinaryCompare Then
If Mid$(S1, ii, l2) = S2 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
Else
If UCase(Mid$(S1, ii, l2)) = l_S3 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
End If
End If
'-------------------------------------------------------
End Select
Next ii
End If
If MyInstr = 0 Then
' No match so far
If l_InQuote = 0 Then
' We're not currently in a quoted run at the end of the string, so check the remaining characters
If l1 - l_QuoteEnd + 1 >= l2 Then
' There are enough remaining characters for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, l1 - l_QuoteEnd + 1), S2, Cmp)
End If
End If
End If
If MyInstr > 0 Then
If l_QuoteEnd > 0 Then
' Add position of closing quote to the matches starting character position
MyInstr = MyInstr + l_QuoteEnd - 1
End If
End If
End Function
Edit:
Sorry, I missed an important parameter: CheckPreviousContent, the corrected code is on #24.
Last edited by dreammanor; Jun 11th, 2019 at 05:48 AM.
-
Jun 10th, 2019, 01:33 AM
#23
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Hi,
I don't know what the Text or Textfile looks like you want to search, perhaps splitting
the problem into smaller parts is an option.
I'm just guessing here
I created a Textfile like this
Code:
Hi there "The title of the book is 'Harry Potter'. " see some "Harry" movie
start in Cinema "in quotes!", world , "more words" bar
with Regex seperate Text in quotes
Code:
Option Explicit
Private pRegEx As Object
Public Property Get oRegex() As Object
If (pRegEx Is Nothing) Then
Set pRegEx = CreateObject("Vbscript.Regexp")
End If
Set oRegex = pRegEx
End Property
Public Function ReadFile(ByRef Path As String) As String
Dim FileNr As Long
On Error Resume Next
If FileLen(Path) = 0 Then Exit Function
On Error GoTo 0
FileNr = FreeFile
Open Path For Binary As #FileNr
ReadFile = Space$(LOF(FileNr))
Get #FileNr, , ReadFile
Close #FileNr
End Function
Private Sub Command1_Click()
Dim cMatches As Object
Dim m As Object
With oRegex
.Pattern = "\""(.+?)\""" 'get all Text between "..."
.Global = True
.MultiLine = True
Set cMatches = .Execute(ReadFile("E:\zTestq.txt"))
For Each m In cMatches
Debug.Print m
'the output:
'"The title of the book is 'Harry Potter'. "
'"Harry"
'"in quotes!"
'"more words"
Next
End With
Set m = Nothing
Set cMatches = Nothing
End Sub
Private Sub Command2_Click()
Dim cMatches As Object
Dim m As Object
With oRegex
.Pattern = "\""(.+?)\""|\s(\w+)" 'get Text outside double quotes
.Global = True
.MultiLine = True
Set cMatches = .Execute(ReadFile("E:\zTestq.txt"))
For Each m In cMatches
Debug.Print m.submatches(1)
'the output:
'Hi
'there
'
'see
'Some
'
'movie
'start
'in
'Cinema
'
'world
'
'bar
Next
End With
Set m = Nothing
Set cMatches = Nothing
End Sub
write the output to new Files and perform the search/count there
hth
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Jun 11th, 2019, 05:24 AM
#24
Thread Starter
PowerPoster
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
Correct the code of #22: I missed an important parameter: CheckPreviousContent
Code:
Public Function MyInstr(ByVal Start As Long, _
ByVal S1 As String, _
ByVal S2 As String, _
Optional ByVal Cmp As VBA.VbCompareMethod = vbBinaryCompare, _
Optional ByVal SearchQuotedContent As Boolean = False, _
Optional ByVal CheckPreviousContent As Boolean = True) As Long
Dim ii As Long
Dim l1 As Long
Dim l2 As Long
Dim l_Char As Integer
Dim l_InQuote As Integer
Dim l_QuoteEnd As Long
Dim l_FirstChar As Integer
Dim l_S3 As String
Dim l_Pos As Long
l1 = Len(S1)
If l1 = 0 Then Exit Function ' Can't match empty string
l2 = Len(S2)
If l2 = 0 Then Exit Function ' Can't match empty string
If l1 < l2 Then Exit Function ' Can't find a longer string in a smaller string
If Start > l1 - l2 + 1 Then Exit Function ' Can't find if start is after end of string1 less the length of string2
'--- DreamManor Added on 2019-06-08 -------------------------------------------
l_FirstChar = AscW(Left$(S2, 1))
If Cmp <> vbBinaryCompare Then
l_S3 = UCase(S2)
End If
If Start > 1 And Not SearchQuotedContent Then
l_Pos = MyInstr(1, S1, S2, Cmp, SearchQuotedContent)
If l_Pos = 0 Then Exit Function
Do While l_Pos < Start
l_Pos = MyInstr(l_Pos + 1, S1, S2, Cmp, SearchQuotedContent, CheckPreviousContent:= False)
If l_Pos = 0 Then Exit Function
Loop
MyInstr = l_Pos
Exit Function
End If
'---------------------------------------------------------------------------------
l_QuoteEnd = Start ' Assume everything before Start is in quotes so we don't check it
If Not SearchQuotedContent Then
For ii = Start To l1
l_Char = AscW(Mid$(S1, ii, 1))
Select Case l_Char
Case 34, 39, 96 ' ", ', `
' Found a quote character
If l_InQuote Then
' We are already within a quoted block of text
If l_InQuote = l_Char Then
' and in a matching quote character
' So close off the quoted content run and remember the starting position of the unquoted run to come
l_InQuote = 0
l_QuoteEnd = ii + 1
End If
Else
' Entering quote - check previous non-quoted chunk to see if we have a match
l_InQuote = l_Char
If ii - l_QuoteEnd >= l2 Then
' The previous unquoted run is long enough for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, ii - l_QuoteEnd), S2, Cmp)
If MyInstr > 0 Then
' We found a match so short-circuit
Exit For
End If
End If
End If
Case l_FirstChar
'--- DreamManor Added on 2019-06-08 -----------------
If l_InQuote = 0 Then
If Cmp = vbBinaryCompare Then
If Mid$(S1, ii, l2) = S2 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
Else
If UCase(Mid$(S1, ii, l2)) = l_S3 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
End If
End If
'-------------------------------------------------------
End Select
Next ii
End If
If MyInstr = 0 Then
' No match so far
If l_InQuote = 0 Then
' We're not currently in a quoted run at the end of the string, so check the remaining characters
If l1 - l_QuoteEnd + 1 >= l2 Then
' There are enough remaining characters for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, l1 - l_QuoteEnd + 1), S2, Cmp)
End If
End If
End If
If MyInstr > 0 Then
If l_QuoteEnd > 0 Then
' Add position of closing quote to the matches starting character position
MyInstr = MyInstr + l_QuoteEnd - 1
End If
End If
End Function
Last edited by dreammanor; Jun 11th, 2019 at 05:48 AM.
-
Jun 11th, 2019, 05:41 AM
#25
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Hi ChrisE, thank you for your code.
I need to search for some characters from HTML, CSS, JavaScript or TypeScript, for example:
(1) Search for the first "{" and the last "}" of the following code block
Code:
/**
* Simple example: search the first `{` and the latst `}`
*/
const enum JSONTokenType {
UNKNOWN = 0,
STRING = 1,
LEFT_SQUARE_BRACKET = 2, // [
LEFT_CURLY_BRACKET = 3, // {
RIGHT_SQUARE_BRACKET = 4, // ]
RIGHT_CURLY_BRACKET = 5, // }
COLON = 6, // :
COMMA = 7, // ,
NULL = 8,
TRUE = 9,
FALSE = 10,
NUMBER = 11
}
(2) Search for the start symbol "{" and the end symbol "}" of a TypeScript function body
Code:
/**
* Complex example: search for the left curly brace ("{") of the function code block and the corresponding right curly brace ("}")
*/
function testMatchers<T>(selector: string, matchesName: (names: string[], matcherInput: T) => { return someObject }): MatcherWithPriority<T>[] {
var results = <MatcherWithPriority<T>[]> [];
var tokenizer = newTokenizer(selector);
var token = tokenizer.next();
while (token !== null) {
let priority : -1 | 0 | 1 = 0;
if (token.length === 2 && token.charAt(1) === ':') {
switch (token.charAt(0)) {
case 'R': priority = 1; break;
case 'L': priority = -1; break;
case '{': priority = 1; break; // {
case '}': priority = -1; break; // }
default:
console.log(`Unknown priority ${token} in scope selector`);
}
token = tokenizer.next();
}
let matcher = parseConjunction();
if (matcher) {
results.push({ matcher, priority });
}
if (token !== '}') {
break;
}
token = tokenizer.next();
}
return results;
}
How to achieve the above goals with regular expressions? Thanks!
-
Jun 13th, 2019, 04:26 AM
#26
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Hi ChrisE, thank you for your code.
I need to search for some characters from HTML, CSS, JavaScript or TypeScript, for example:
(1) Search for the first "{" and the last "}" of the following code block
Code:
/**
* Simple example: search the first `{` and the latst `}`
*/
const enum JSONTokenType {
UNKNOWN = 0,
STRING = 1,
LEFT_SQUARE_BRACKET = 2, // [
LEFT_CURLY_BRACKET = 3, // {
RIGHT_SQUARE_BRACKET = 4, // ]
RIGHT_CURLY_BRACKET = 5, // }
COLON = 6, // :
COMMA = 7, // ,
NULL = 8,
TRUE = 9,
FALSE = 10,
NUMBER = 11
}
(2) Search for the start symbol "{" and the end symbol "}" of a TypeScript function body
Code:
/**
* Complex example: search for the left curly brace ("{") of the function code block and the corresponding right curly brace ("}")
*/
function testMatchers<T>(selector: string, matchesName: (names: string[], matcherInput: T) => { return someObject }): MatcherWithPriority<T>[] {
var results = <MatcherWithPriority<T>[]> [];
var tokenizer = newTokenizer(selector);
var token = tokenizer.next();
while (token !== null) {
let priority : -1 | 0 | 1 = 0;
if (token.length === 2 && token.charAt(1) === ':') {
switch (token.charAt(0)) {
case 'R': priority = 1; break;
case 'L': priority = -1; break;
case '{': priority = 1; break; // {
case '}': priority = -1; break; // }
default:
console.log(`Unknown priority ${token} in scope selector`);
}
token = tokenizer.next();
}
let matcher = parseConjunction();
if (matcher) {
results.push({ matcher, priority });
}
if (token !== '}') {
break;
}
token = tokenizer.next();
}
return results;
}
How to achieve the above goals with regular expressions? Thanks!
Regex is the wrong Tool for the above, it would work to a certain point but you will have to wright your own parser for that search
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Jun 14th, 2019, 02:43 AM
#27
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by ChrisE
Regex is the wrong Tool for the above, it would work to a certain point but you will have to wright your own parser for that search
Yes, you are right. Currently, jpbro's approach seems to be the most feasible.
-
Jun 14th, 2019, 02:55 AM
#28
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Yes, you are right. Currently, jpbro's approach seems to be the most feasible.
You'll need a lexer that tokenizes the input for a non-paliative solution. jpbro's approach is pretty unextendable and falls flat with string literals and open/close brackets inside block/line comments for instance.
IMO you don't need full language parser, just a lexer to impl keywords/strings/numbers highlighting and/or "match opening/closing bracket" functionality.
Btw, PEG parsers combine lexer/parser (i.e. they don't have a separate lexer) but I'm positive VbPeg can be used to impl a JS/TS lexer that returns array of (token_type, offset+size) tuples from an input string. It's the nesting of the { } that a JS/TS parser would handle while the lexer just marks these as OPEN_BACKET/CLOSE_BRACKET types only, w/ no nesting level tracked.
cheers,
</wqw>
-
Jun 14th, 2019, 05:19 AM
#29
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Yes, you are right. Currently, jpbro's approach seems to be the most feasible.
I tried your two samples for the search out of interest.
here the results, you'll see that the first seams to to work correct, but the second doesn't
so with regex it 'kinda' works a little
Code:
Private Sub Command3_Click()
Dim cMatches As Object
Dim m As Object
With oRegex
'get first { and ignore any closing } brackets in between
'go to the last closing bracket }
.Pattern = "\{[^()]*\}*"
.Global = True
.MultiLine = True
Set cMatches = .Execute(ReadFile("E:\zSearch.txt"))
For Each m In cMatches
Debug.Print m
''output from Textfile zSearch.txt:
'{
' UNKNOWN = 0,
' STRING = 1,
' LEFT_SQUARE_BRACKET = 2, // [
' LEFT_CURLY_BRACKET = 3, // {
' RIGHT_SQUARE_BRACKET = 4, // ]
' RIGHT_CURLY_BRACKET = 5, // }
' COLON = 6, // :
' COMMA = 7, // ,
' NULL = 8,
' TRUE = 9,
' FALSE = 10,
' Number = 11
'}
'output other textfile zSearch2.txt:
'{ return someObject }
'{
' var results = <MatcherWithPriority<T>[]> [];
' Var tokenizer = newTokenizer
'{
' let priority : -1 | 0 | 1 = 0;
' if
'{
' Switch
'{
' case 'R': priority = 1; break;
' case 'L': priority = -1; break;
' case '{': priority = 1; break; // {
' case '}': priority = -1; break; // }
'default:
' console.Log
'{token} in scope selector`
'{
' results.push
'{ matcher, priority }
'{
' break;
' }
' token = tokenizer.Next
Next
End With
Set m = Nothing
Set cMatches = Nothing
End Sub
Last edited by ChrisE; Jun 14th, 2019 at 05:49 AM.
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Jun 14th, 2019, 08:00 AM
#30
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by wqweto
You'll need a lexer that tokenizes the input for a non-paliative solution. jpbro's approach is pretty unextendable and falls flat with string literals and open/close brackets inside block/line comments for instance
Agreed - my approach was only intended as a response to the original question for an InStr replacement that ignores text within various "quotes". Even then it was only posted as a nudge in a possible direction as I wrote it in a few minutes and didn't test it much at all. So anyone using it please beware - it's not polished/production-ready code! If the ultimate need is for a lexer, then my approach is not appropriate.
-
Jun 15th, 2019, 10:29 PM
#31
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by jpbro
Agreed - my approach was only intended as a response to the original question for an InStr replacement that ignores text within various "quotes". Even then it was only posted as a nudge in a possible direction as I wrote it in a few minutes and didn't test it much at all. So anyone using it please beware - it's not polished/production-ready code! If the ultimate need is for a lexer, then my approach is not appropriate.
Yes, I need not only a lexer but also a full-language parser. But your code MyInstr is still very valuable to me, I'll further improve it, and will develop MySplit based on it, these functions can be used to search for some strings in HTML, CSS. Thank you, jpbro.
-
Jun 15th, 2019, 10:30 PM
#32
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by ChrisE
I tried your two samples for the search out of interest.
here the results, you'll see that the first seams to to work correct, but the second doesn't
so with regex it 'kinda' works a little
Code:
Private Sub Command3_Click()
Dim cMatches As Object
Dim m As Object
With oRegex
'get first { and ignore any closing } brackets in between
'go to the last closing bracket }
.Pattern = "\{[^()]*\}*"
.Global = True
.MultiLine = True
Set cMatches = .Execute(ReadFile("E:\zSearch.txt"))
For Each m In cMatches
Debug.Print m
''output from Textfile zSearch.txt:
'{
' UNKNOWN = 0,
' STRING = 1,
' LEFT_SQUARE_BRACKET = 2, // [
' LEFT_CURLY_BRACKET = 3, // {
' RIGHT_SQUARE_BRACKET = 4, // ]
' RIGHT_CURLY_BRACKET = 5, // }
' COLON = 6, // :
' COMMA = 7, // ,
' NULL = 8,
' TRUE = 9,
' FALSE = 10,
' Number = 11
'}
'output other textfile zSearch2.txt:
'{ return someObject }
'{
' var results = <MatcherWithPriority<T>[]> [];
' Var tokenizer = newTokenizer
'{
' let priority : -1 | 0 | 1 = 0;
' if
'{
' Switch
'{
' case 'R': priority = 1; break;
' case 'L': priority = -1; break;
' case '{': priority = 1; break; // {
' case '}': priority = -1; break; // }
'default:
' console.Log
'{token} in scope selector`
'{
' results.push
'{ matcher, priority }
'{
' break;
' }
' token = tokenizer.Next
Next
End With
Set m = Nothing
Set cMatches = Nothing
End Sub
Thank you, ChrisE. Is it possible to accomplish some very complex search logic with multiple RegExp patterns?
-
Jun 15th, 2019, 10:45 PM
#33
Thread Starter
PowerPoster
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by wqweto
You'll need a lexer that tokenizes the input for a non-paliative solution. jpbro's approach is pretty unextendable and falls flat with string literals and open/close brackets inside block/line comments for instance.
IMO you don't need full language parser, just a lexer to impl keywords/strings/numbers highlighting and/or "match opening/closing bracket" functionality.
Btw, PEG parsers combine lexer/parser (i.e. they don't have a separate lexer) but I'm positive VbPeg can be used to impl a JS/TS lexer that returns array of (token_type, offset+size) tuples from an input string. It's the nesting of the { } that a JS/TS parser would handle while the lexer just marks these as OPEN_BACKET/CLOSE_BRACKET types only, w/ no nesting level tracked.
cheers,
</wqw>
Hi wqweto, I've been learning about PEG for a few days, but obviously I still need to spend more time studying. Could you explain the technical difference between your VbPEG and Gold Parser and PEG.js? Thank you.
In addition, I'd to know if VbPEG can achieve conversion between different languages. If you could demonstrate how to convert a small piece of kscope code into VB code, that would be great.
Edit:
When I execute "VbPeg.exe VbPeg.peg -tree" and "VbPeg.exe VbPeg.peg -ir", the result displayed in the console is the content of cParser.cls.
Last edited by dreammanor; Jun 15th, 2019 at 10:51 PM.
-
Jun 16th, 2019, 12:01 AM
#34
Re: (String search algorithm) Skip the contents of the quotes to find a substring
Originally Posted by dreammanor
Thank you, ChrisE. Is it possible to accomplish some very complex search logic with multiple RegExp patterns?
like I said .. it 'kinda' works. take the advice from wqweto.
regex is the wrong Tool for this
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
-
Jun 16th, 2019, 01:57 AM
#35
Addicted Member
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
Originally Posted by dreammanor
Correct the code of #22: I missed an important parameter: CheckPreviousContent
Code:
Public Function MyInstr(ByVal Start As Long, _
ByVal S1 As String, _
ByVal S2 As String, _
Optional ByVal Cmp As VBA.VbCompareMethod = vbBinaryCompare, _
Optional ByVal SearchQuotedContent As Boolean = False, _
Optional ByVal CheckPreviousContent As Boolean = True) As Long
Dim ii As Long
Dim l1 As Long
Dim l2 As Long
Dim l_Char As Integer
Dim l_InQuote As Integer
Dim l_QuoteEnd As Long
Dim l_FirstChar As Integer
Dim l_S3 As String
Dim l_Pos As Long
l1 = Len(S1)
If l1 = 0 Then Exit Function ' Can't match empty string
l2 = Len(S2)
If l2 = 0 Then Exit Function ' Can't match empty string
If l1 < l2 Then Exit Function ' Can't find a longer string in a smaller string
If Start > l1 - l2 + 1 Then Exit Function ' Can't find if start is after end of string1 less the length of string2
'--- DreamManor Added on 2019-06-08 -------------------------------------------
l_FirstChar = AscW(Left$(S2, 1))
If Cmp <> vbBinaryCompare Then
l_S3 = UCase(S2)
End If
If Start > 1 And Not SearchQuotedContent Then
l_Pos = MyInstr(1, S1, S2, Cmp, SearchQuotedContent)
If l_Pos = 0 Then Exit Function
Do While l_Pos < Start
l_Pos = MyInstr(l_Pos + 1, S1, S2, Cmp, SearchQuotedContent, CheckPreviousContent:= False)
If l_Pos = 0 Then Exit Function
Loop
MyInstr = l_Pos
Exit Function
End If
'---------------------------------------------------------------------------------
l_QuoteEnd = Start ' Assume everything before Start is in quotes so we don't check it
If Not SearchQuotedContent Then
For ii = Start To l1
l_Char = AscW(Mid$(S1, ii, 1))
Select Case l_Char
Case 34, 39, 96 ' ", ', `
' Found a quote character
If l_InQuote Then
' We are already within a quoted block of text
If l_InQuote = l_Char Then
' and in a matching quote character
' So close off the quoted content run and remember the starting position of the unquoted run to come
l_InQuote = 0
l_QuoteEnd = ii + 1
End If
Else
' Entering quote - check previous non-quoted chunk to see if we have a match
l_InQuote = l_Char
If ii - l_QuoteEnd >= l2 Then
' The previous unquoted run is long enough for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, ii - l_QuoteEnd), S2, Cmp)
If MyInstr > 0 Then
' We found a match so short-circuit
Exit For
End If
End If
End If
Case l_FirstChar
'--- DreamManor Added on 2019-06-08 -----------------
If l_InQuote = 0 Then
If Cmp = vbBinaryCompare Then
If Mid$(S1, ii, l2) = S2 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
Else
If UCase(Mid$(S1, ii, l2)) = l_S3 Then
l_QuoteEnd = 0: MyInstr = ii: Exit For
End If
End If
End If
'-------------------------------------------------------
End Select
Next ii
End If
If MyInstr = 0 Then
' No match so far
If l_InQuote = 0 Then
' We're not currently in a quoted run at the end of the string, so check the remaining characters
If l1 - l_QuoteEnd + 1 >= l2 Then
' There are enough remaining characters for a possible match
MyInstr = InStr(1, Mid$(S1, l_QuoteEnd, l1 - l_QuoteEnd + 1), S2, Cmp)
End If
End If
End If
If MyInstr > 0 Then
If l_QuoteEnd > 0 Then
' Add position of closing quote to the matches starting character position
MyInstr = MyInstr + l_QuoteEnd - 1
End If
End If
End Function
This code has a problem, and I don't konw if has a solution, you can try
Code:
Debug.Print MyInstr(1, "Mark O'Brian, Tim O'Sullivan", "Tim")
Returns 0 but I think the correct answer is 15
-
Jun 16th, 2019, 06:57 AM
#36
Thread Starter
PowerPoster
Re: [RESOLVED] (String search algorithm) Skip the contents of the quotes to find a su
Originally Posted by gilman
This code has a problem, and I don't konw if has a solution, you can try
Code:
Debug.Print MyInstr(1, "Mark O'Brian, Tim O'Sullivan", "Tim")
Returns 0 but I think the correct answer is 15
Hi Gilman, the correct return value should be 0.
In addition, you can add the judgment of the escape character ("\") to MyInstr. In this case, if you want the return value to be 15, you can add an escape character ("\") to the left of the single quote.
Last edited by dreammanor; Jun 16th, 2019 at 07:08 AM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|