|
-
Mar 3rd, 2006, 06:07 PM
#1
Thread Starter
Fanatic Member
Need ideas for a string function optimization
I have a function for finding the longest matching substring in two string that is in need of some help. If you pass it two different strings, it will return the longest string fragment they have in common. I.e., if you had these two strings,
"This is a test string."
"What is a test object."
it would return " is a test ". Usually calls to it will be much more complex . A good baseline call would be a string with 2000 characters, and the same string with 10 arbitrary inserts and 10 arbitrary deletes. This takes an average of about 5 seconds on my machine. Currently, my code is spending about 99% percent of processing time in this function, so it's in obvious need of any optimizations I can come up with. I tried API functions to replace Mid$, but they were a bit slower--I also might try using byte arrays.
Here's the main function:
VB Code:
Private Function SeekMatch(ByVal sOne As String, ByVal sTwo As String) As String
'This function finds the longest substring of sTwo that exists in sOne, and returns it. No matches
'returns a vbNullString.
Dim sTest As String, lSize As Long, lStart As Long, lResult As Long, lLength As Long
On Error GoTo ErrHand
lLength = Len(sTwo)
For lSize = lLength - 1 To 1 Step -1 'Loop through all posible substring lengths, descending.
For lStart = lLength - lSize To 1 Step -1 'Loop through all possible substring positions.
sTest = Mid$(sTwo, lStart, lSize) 'Get the substring from the Buffer string.
If InStrB(1, sOne, sTest) <> 0 Then 'If there is a match,
SeekMatch = sTest 'Return and exit.
Exit Function
End If
Next lStart
Next lSize
ErrHand:
If Err.Number <> 0 Then
Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
"in SeekMatch.", vbCritical)
End If
End Function
Here are a couple test functions I use to run benchmarks:
VB Code:
Private Declare Function GetTickCount& Lib "kernel32" ()
Private Sub BenchMark()
Dim sString1 As String, sString2 As String, lStart As Long, lEnd As Long, sOut As String
sString1 = RandomString(2000)
sString2 = RandomChanges(sString1, 10, 10, 5)
Debug.Print sString1
Debug.Print sString2
lStart = GetTickCount
sOut = SeekMatch(sString1, sString2)
lEnd = GetTickCount
Debug.Print sOut
Debug.Print lEnd - lStart
End Sub
Private Function RandomString(lChars As Long) As String
Dim lCount As Long, bTemp() As Byte
ReDim bTemp(lChars)
For lCount = LBound(bTemp) To UBound(bTemp)
bTemp(lCount) = Rand(33, 126)
Next lCount
RandomString = StrConv(bTemp, vbUnicode)
End Function
Private Function RandomChanges(ByVal sString As String, lInserts As Long, lDeletes As Long, _
Optional lBoundry As Long = 10) As String
Dim lCount As Long, sOut As String, lStart As Long, lEnd As Long, sInsert As String, lLen As Long
Dim sLeft As String, sRight As String, lTemp As Long
For lCount = 1 To lInserts
lLen = Len(sString)
lStart = Rand(1, lLen)
sInsert = RandomString(Rand(1, lBoundry))
If lStart = 0 Then
sString = sInsert & sString
ElseIf lStart = lLen Then
sString = sString & sInsert
Else
sLeft = Left$(sString, lStart)
sRight = Right$(sString, lLen - lStart)
sString = sLeft & sInsert & sRight
End If
Next lCount
For lCount = 1 To lDeletes
lLen = Len(sString)
lStart = Rand(1, lLen)
If lStart + lBoundry > lLen Then
lEnd = lLen
Else
lEnd = Rand(lStart, lStart + lBoundry)
End If
If lStart = 1 And lEnd = lLen Then
sString = vbNullString
Exit For
End If
If lEnd = lLen Then
sString = Left$(sString, lStart - 1)
ElseIf lStart = 1 Then
sString = Right$(sString, lLen - lEnd)
Else
sLeft = Left$(sString, lStart - 1)
sRight = Right$(sString, lLen - lEnd)
sString = sLeft & sRight
End If
Next lCount
RandomChanges = sString
End Function
Private Function Rand(lLow As Long, lHigh As Long) As Long
Rand = Int((lHigh - lLow + 1) * Rnd + lLow)
End Function
-
Mar 3rd, 2006, 07:02 PM
#2
Hyperactive Member
Re: Need ideas for a string function optimization
The following tested faster on longer (2000-character) strings, but slower on shorter ones:
VB Code:
Private Function MySeekMatch(ByVal sOne As String, ByVal sTwo As String) As String
'This function finds the longest substring of sTwo that exists in sOne, and returns it. No matches
'returns a vbNullString.
Dim iLongest As String 'Index into sTwo of longest match found.
Dim lLongest As Long 'Length of longest match found.
Dim i As Long
Dim j As Long
Dim k As Long
On Error GoTo ErrHand
iLongest = 0
lLongest = 0
i = 1
Do Until i > (Len(sOne) - lLongest)
j = 1
Do Until j > (Len(sTwo) - lLongest)
'Only look for substrings longer than the current longest substring.
For k = lLongest + 1 To Len(sTwo) - j + 1
If Mid$(sOne, i, k) <> Mid$(sTwo, j, k) Then Exit For
iLongest = j
lLongest = k
Next
j = j + 1
Loop
i = i + 1
Loop
If lLongest > 0 Then MySeekMatch = Mid$(sTwo, iLongest, lLongest)
ErrHand:
If Err.Number <> 0 Then
Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
"in SeekMatch.", vbCritical)
End If
End Function
But you're right, using byte arrays would probably yield a significant increase in speed.
Somewhere, I have Sedgewick's book of algorithms...
Last edited by bpd; Mar 3rd, 2006 at 07:09 PM.
-
Mar 3rd, 2006, 07:21 PM
#3
Hyperactive Member
Re: Need ideas for a string function optimization
BTW, your function does not appear to be correct in the case where the largest substring is found at the end of the strings. For example, if you give it two identical, 10-character strings, your function returns only the first 9 characters as the match.
-
Mar 3rd, 2006, 08:29 PM
#4
Thread Starter
Fanatic Member
Re: Need ideas for a string function optimization
 Originally Posted by bpd
BTW, your function does not appear to be correct in the case where the largest substring is found at the end of the strings. For example, if you give it two identical, 10-character strings, your function returns only the first 9 characters as the match.
No, that's intentional. The current implementation strips all matching characters from the right and the left before the function is called.
-
Mar 3rd, 2006, 10:46 PM
#5
Thread Starter
Fanatic Member
Re: Need ideas for a string function optimization
OK, byte array searches are several orders of magnitude faster, but I can't get them to always be as accurate. Ultimately it won't matter, because the result it returns is always a subset of the truest long match, so I can fix it outside the function. I can change code elsewhere to account for the loss in accuracy in exchange for a hundred fold speed increase .
VB Code:
Private Function SeekMatchB(ByVal sOne As String, ByVal sTwo As String) As String
'This function finds the longest substring of sTwo that exists in sOne, and returns it. No matches
'returns a vbNullString.
Dim sTest As String, lSize As Long, lSeek As Long, lResult As Long, lBoundTwo As Long, lPos As Long
Dim bOne() As Byte, bTwo() As Byte, lStart As Long, lLength As Long, lBoundOne As Long
Dim lFloor As Long
On Error GoTo ErrHand
bOne = StrConv(sOne, vbFromUnicode) 'Write sOne into an array.
bTwo = StrConv(sTwo, vbFromUnicode) 'Write sTwo into an array.
lBoundTwo = UBound(bTwo) 'Store the UBound of bTwo.
lBoundOne = UBound(bOne) 'Store the UBound of bOne.
lFloor = LBound(bTwo) 'Both arrays LBound will be the same.
For lSeek = lFloor To lBoundTwo
lPos = lFloor 'Reset the string one seek to the start.
Do Until bTwo(lSeek) = bOne(lPos) 'Seek forward in sOne until there is a match.
lPos = lPos + 1 'Increment the seek counter.
If lPos > lBoundOne Then 'Check for overbound.
lPos = lBoundOne 'Decrement it again.
Exit Do 'And bail out of the loop.
End If
Loop
lSize = lSeek 'Initialize with the current start index.
Do Until bTwo(lSize) <> bOne(lPos) 'Concurrently seek forward in both until a mismatch.
lSize = lSize + 1 'Increment the sTwo index.
lPos = lPos + 1 'Increment the sOne index.
If lPos > lBoundOne Then 'Check for overbound.
lPos = lBoundOne 'Decrement it again.
lSize = lSize - 1 'Same with the other indicator.
Exit Do 'Bail.
End If
If lSize > lBoundTwo Then 'Ditto.
lSize = lBoundTwo
lPos = lPos - 1
Exit Do
End If
Loop
lResult = lSize - lSeek 'Calculate how long the match was.
If lResult > lLength Then 'Check against the last found match.
lStart = lSeek 'Update the found starting position.
lLength = lResult 'Update the longest find.
End If
If lLength + lSeek > lBoundTwo Then Exit For 'Test for early exit. No possible matches left.
Next lSeek
If lLength <> 0 Then 'Make sure there was a match found.
SeekMatchB = Mid$(sTwo, lStart + 1, lLength) 'Return the resulting matched string.
End If
ErrHand:
If Err.Number <> 0 Then 'Check for errors.
Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
"in SeekMatchB.", vbCritical)
End If
End Function
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|