I can't do LCase$ to an array Besides doing LCase$ to data that might be even 10 MB or more... you get the picture. It is super slow. That is probably what InStr does in TextCompare and that's the reason it is so slow.
Also, arrays must be passed ByRef. They can't be passed ByVal. ByRef is also faster, because there is no need to make a copy of the variable.
And for the last words, doing binary comparison is super fast. Also... my current version of InBArr is almost as fast in TextCompare as it is in BinaryCompare. I've used a few tricks in the code to strip down unrequired calculations. Doing fast code means that the code must be long as well.
This is a boosted version using the Boyer-Moore string searching algorithm. It doesn't work as fast in all cases, but it beats InStr very well under many circumtances. It is perfect for searching a lot of text with a longer keyword (suprisingly, the longer the keyword, the faster it searches [except if there is a lot of content like the keyword]). It doesn't need to make as much work as InStr which works with brute force logic.
It isn't very hard to change it to count strings: just instead of exit function you count up and continue from where you were.
Oh, and due to the nature of Boyer-Moore, there is a keyword optimization. It checks if the keyword has changed or not and if the settings have changed. I used a pretty weird approach, it is within a Do Loop which isn't actually a loop. I noticed that if I used Goto, the function slowed down for some reason.
Good stuff, I was about to post my results which were very unfavourable when run from the vbProject window, but when run from the compiled optimized exe it's very quick, good work! I've given you 5 stars on PSC!! I had to substitute your timer module for a timer class I've got as I keep getting a "calculation too complex" error with yours.
I notice too that you've changed it to return the same position as InStr() which I think is a good thing
Cheers Al
Last edited by aconybeare; Jul 18th, 2005 at 04:22 AM.
There is only a code that changes InStr and InBArr values to be identical in all cases, so yuu can easily compare the results with eachother. I give you a few examples why InBArr gives a different position value:
VB Code:
Dim Temp As String
' set a string
Temp = "abc"
' return the letter b if it exists in the string (we get an error if it doesn't)
MsgBox Mid$(Temp, InStr(Temp, "b", 1))
This code returns a string with text B in it.
VB Code:
Dim Temp() As Byte, PosValue As Long
Dim Temp2() As Byte
' the array will be filled with UTF-16 character data (2 bytes/character)
Temp = "abc"
' get position in array
PosValue = InBArrBM(Temp, "b")
' reserve enough memory for another temporary array
ReDim Temp2(1)
' fill the character into this array
Temp2(0) = Temp(PosValue)
Temp2(1) = Temp(PosValue + 1)
' convert the array to a string and display in a messagebox
MsgBox CStr(Temp2)
This isn't the best possible example, but it clearly shows why it doesn't give the same value than InStr. I didn't code a support for Option Base 1 because that would be a killer to the performance. Besides handling Option Base 0 (default) is much easier for a programmer as well once he gets used to it.
InStr is very slow especially when you use TextCompare. InStr doesn't even use any special technique to perform the search, it does it the hard brute force way (ie. checks every single byte). So that is why I coded something that is faster than InStr in many normal search situatations; and knowing InStr is most likely coded in ASM, it isn't a bad result. Though the main aim with my function was to provide a search function for byte arrays; beating InStr was a secondary goal.
Btw, there are some people who have made ThunderVB, which allows adding ASM code into a VB6 app. Afaik they've made InStr replacement using Boyer-Moore, which should beat the native InStr in about every possible case.
Let me know if you can improve upon what's going on here.
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
The codes seem to lack Unicode support which gives invalid results when some special character languages are used. They also lack TextCompare support. Full codes would be nice so I didn't need to guess all the variables and API declarations used out-of-the-function.
The codes seem to lack Unicode support which gives invalid results when some special character languages are used. They also lack TextCompare support. Full codes would be nice so I didn't need to guess all the variables and API declarations used out-of-the-function.
Wrong thread?
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
Well, if you're planning to make a proper InStrCount, you need to have Unicode and TextCompare support. Both your and penagate's codes there seem to check for every other byte. penagate's code also crashes if the keyword is longer than one character.
What is the real goal with the functions? They seem more like an ANSI character counters than functions looking for the number of strings within a string.
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
Hmm, so which code snippet I should actually be looking at? The one you directly linked to or the ones that were optimized? I started from the end of the thread.
Public Function Sisic4(pStr As Long, pFind As Long, lenStr As Long, lenFind As Long) As Long
Dim i As Long
Dim j As Long
Dim Flag As Long
CopyMemory BufStr(0), ByVal pStr, lenStr
CopyMemory BufFind(0), ByVal pFind, lenFind
i = lenStr - 1
If lenFind = 2 Then
Do Until i < lenFind
If BufStr(i - 1) = BufFind(0) Then
Sisic4 = Sisic4 + 1
End If
i = i - 2
Loop
Else
Do Until i < lenFind
Flag = 0
j = lenFind - 1
Do Until j < 0
If Not (BufStr(i - (lenFind - j)) = BufFind(j - 1)) Then
Flag = -1
Exit Do
End If
j = j - 2
Loop
If Flag = 0 Then
Sisic4 = Sisic4 + 1
End If
i = i - 2
Loop
End If
End Function
Here's the stable version of my effort: penegate has some concerns over my final effort.
You will need to define arrays outside of the function. If you change the '-2' to '-1' you should have a unicode compliant function.
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
BTW: SISIC stands for "String In String Instance Count"
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
Your function didn't find "Now" as a keyword. It also fails with longer strings, because the byte array is always 512 bytes long... guess I'll increase the length manually. But that really isn't a great way to do it, imo.
As for whether or not it's a good or bad way? Depends on how you write your code, I guess.
I nearly always allocate 2k of heap for a process at application start up for stuff like this, so it makes no odds about the arrays (I copy the string to the heap and modify the pvData in the SAFEARRAY descriptor to the right place on the heap (as well as modifying other parts of the descriptor)
In terms of efficiency, both peneage, and myself believe that this is probably the fastest way of doing the job.
As you know speed nearly always compromises readibility
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
Here you have something that can take a very big string and still be fast (the ones by you and penagate keep getting slower and slower with bigger strings much more easily):
Code:
Option Explicit
Private Declare Sub RtlMoveMemory Lib "ntdll.dll" (ByRef lpvDest As Any, ByRef lpvSrc As Any, ByVal cbLen As Long)
'Private Declare Function VarPtrArray Lib "msvbvm50.dll" Alias "VarPtr" (Var() As Any) As Long
Private Declare Function VarPtrArray Lib "msvbvm60.dll" Alias "VarPtr" (Var() As Any) As Long
Private BufStrHeader(5) As Long
Private BufFindHeader(5) As Long
Private BufStr() As Integer
Private BufFind() As Integer
Private OldStr As Long
Private OldFind As Long
Public Sub SisicInitialize()
BufStrHeader(0) = 1
BufStrHeader(1) = 2
BufStrHeader(4) = &H7FFFFFFF
BufFindHeader(0) = 1
BufFindHeader(1) = 2
BufFindHeader(4) = &H7FFFFFFF
OldStr = 0
OldFind = 0
End Sub
Public Sub SisicTerminate()
RtlMoveMemory ByVal VarPtrArray(BufStr), 0&, 4
RtlMoveMemory ByVal VarPtrArray(BufFind), 0&, 4
End Sub
Public Function SisicM(pStr As Long, pFind As Long, lenStr As Long, lenFind As Long) As Long
Dim i As Long
Dim j As Long
Dim k As Long
Dim l As Long
Dim Flag As Long
If OldStr <> pStr Then
BufStrHeader(3) = pStr
RtlMoveMemory ByVal VarPtrArray(BufStr), VarPtr(BufStrHeader(0)), 4
OldStr = pStr
End If
If OldFind <> pFind Then
BufFindHeader(3) = pFind
RtlMoveMemory ByVal VarPtrArray(BufFind), VarPtr(BufFindHeader(0)), 4
OldFind = pFind
End If
If lenFind = 1 Then
j = BufFind(0)
For i = lenStr - 1 To 0 Step -1
k = BufStr(i)
If k = j Then SisicM = SisicM + 1
Next i
Else
lenFind = lenFind - 1
For i = lenStr - 1 To lenFind Step -1
For j = lenFind To 0 Step -1
k = BufFind(j)
l = BufStr(i - (lenFind - j))
If Not (k = l) Then Flag = 1: Exit For
Next j
If Flag = 0 Then SisicM = SisicM + 1 Else Flag = 0
Next i
End If
End Function
Here you have something that can take a very big string and still be fast (the ones by you and penagate keep getting slower and slower with bigger strings much more easily):
Code:
Option Explicit
Private Declare Sub RtlMoveMemory Lib "ntdll.dll" (ByRef lpvDest As Any, ByRef lpvSrc As Any, ByVal cbLen As Long)
'Private Declare Function VarPtrArray Lib "msvbvm50.dll" Alias "VarPtr" (Var() As Any) As Long
Private Declare Function VarPtrArray Lib "msvbvm60.dll" Alias "VarPtr" (Var() As Any) As Long
Private BufStrHeader(5) As Long
Private BufFindHeader(5) As Long
Private BufStr() As Integer
Private BufFind() As Integer
Private OldStr As Long
Private OldFind As Long
Public Sub SisicInitialize()
BufStrHeader(0) = 1
BufStrHeader(1) = 2
BufStrHeader(4) = &H7FFFFFFF
BufFindHeader(0) = 1
BufFindHeader(1) = 2
BufFindHeader(4) = &H7FFFFFFF
OldStr = 0
OldFind = 0
End Sub
Public Sub SisicTerminate()
RtlMoveMemory ByVal VarPtrArray(BufStr), 0&, 4
RtlMoveMemory ByVal VarPtrArray(BufFind), 0&, 4
End Sub
Public Function SisicM(pStr As Long, pFind As Long, lenStr As Long, lenFind As Long) As Long
Dim i As Long
Dim j As Long
Dim k As Long
Dim l As Long
Dim Flag As Long
If OldStr <> pStr Then
BufStrHeader(3) = pStr
RtlMoveMemory ByVal VarPtrArray(BufStr), VarPtr(BufStrHeader(0)), 4
OldStr = pStr
End If
If OldFind <> pFind Then
BufFindHeader(3) = pFind
RtlMoveMemory ByVal VarPtrArray(BufFind), VarPtr(BufFindHeader(0)), 4
OldFind = pFind
End If
If lenFind = 1 Then
j = BufFind(0)
For i = lenStr - 1 To 0 Step -1
k = BufStr(i)
If k = j Then SisicM = SisicM + 1
Next i
Else
lenFind = lenFind - 1
For i = lenStr - 1 To lenFind Step -1
For j = lenFind To 0 Step -1
k = BufFind(j)
l = BufStr(i - (lenFind - j))
If Not (k = l) Then Flag = 1: Exit For
Next j
If Flag = 0 Then SisicM = SisicM + 1 Else Flag = 0
Next i
End If
End Function
I think you'll find that your algorithm (which is almost identical to penegates, and mine) has just the same linear performance problem associated with longer strings.
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
No, because your versions used CopyMemory to copy the whole string in memory from place A to B. This doesn't copy anything, it just moves an array starting point to point to the string and start working from there. If you think about it: is it first faster to copy 50 MB of string data to byte array or start directly handling the 50 MB?
The algorithm itself was almost identical because I based it on your code. I just wanted to get it working. The next step would be to apply some nice optimizations and probably try adding Boyer-Moore and a support for TextCompare.
Hmmm. Pretty good stuff.
I wouldn't use the Boyer-Moore algorithm, though. This is good for long strings - especially ones that have multiple searches applied.
But for short ones which are only going to be searched once, then the creation of the distance table make the use of the algorithm inefficient.
I suppose you could optimise further by having some sort of threshold where the Boyer-Moore would take over . . .
(BTW - Nice touch avoiding the copy!)
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
Boyer-Moore for keywords longer than four or five characters might be ok. I also made quickly a InBArrBMCount (ridiculous function name, but oh well) and it gets faster in keywords of that length.
Edit Oh... and Boyer-Moore for TextCompare is all good starting from three characters long keywords, I guess you can figure out why.
All you need to do now is to slot the algorithm into a fully inherited object hierarchy and you too can lose all of that perfomance
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
I think someone's already tried that. If I recall correctly they called it MSIL . . .
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein
I love power failures. They have the magical touch of striking right before you have saved the last major changes you've done. Looks like I won't work out Boyer-Moore after all, doing it the second time today would be bothersome.