Results 1 to 5 of 5

Thread: Need ideas for a string function optimization

  1. #1

    Thread Starter
    Fanatic Member Comintern's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln, NE
    Posts
    826

    Need ideas for a string function optimization

    I have a function for finding the longest matching substring in two string that is in need of some help. If you pass it two different strings, it will return the longest string fragment they have in common. I.e., if you had these two strings,

    "This is a test string."
    "What is a test object."

    it would return " is a test ". Usually calls to it will be much more complex . A good baseline call would be a string with 2000 characters, and the same string with 10 arbitrary inserts and 10 arbitrary deletes. This takes an average of about 5 seconds on my machine. Currently, my code is spending about 99% percent of processing time in this function, so it's in obvious need of any optimizations I can come up with. I tried API functions to replace Mid$, but they were a bit slower--I also might try using byte arrays.

    Here's the main function:
    VB Code:
    1. Private Function SeekMatch(ByVal sOne As String, ByVal sTwo As String) As String
    2.  
    3.     'This function finds the longest substring of sTwo that exists in sOne, and returns it.  No matches
    4.     'returns a vbNullString.
    5.  
    6.     Dim sTest As String, lSize As Long, lStart As Long, lResult As Long, lLength As Long
    7.  
    8.     On Error GoTo ErrHand
    9.    
    10.     lLength = Len(sTwo)
    11.    
    12.     For lSize = lLength - 1 To 1 Step -1            'Loop through all posible substring lengths, descending.
    13.         For lStart = lLength - lSize To 1 Step -1   'Loop through all possible substring positions.
    14.             sTest = Mid$(sTwo, lStart, lSize)       'Get the substring from the Buffer string.
    15.             If InStrB(1, sOne, sTest) <> 0 Then     'If there is a match,
    16.                 SeekMatch = sTest                   'Return and exit.
    17.                 Exit Function
    18.             End If
    19.         Next lStart
    20.     Next lSize
    21.  
    22. ErrHand:
    23.     If Err.Number <> 0 Then                                
    24.         Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
    25.                     "in SeekMatch.", vbCritical)
    26.     End If
    27.    
    28. End Function
    Here are a couple test functions I use to run benchmarks:
    VB Code:
    1. Private Declare Function GetTickCount& Lib "kernel32" ()
    2.  
    3. Private Sub BenchMark()
    4.  
    5.     Dim sString1 As String, sString2 As String, lStart As Long, lEnd As Long, sOut As String
    6.    
    7.     sString1 = RandomString(2000)
    8.     sString2 = RandomChanges(sString1, 10, 10, 5)
    9.  
    10.     Debug.Print sString1
    11.     Debug.Print sString2
    12.    
    13.     lStart = GetTickCount
    14.     sOut = SeekMatch(sString1, sString2)
    15.     lEnd = GetTickCount
    16.  
    17.     Debug.Print sOut
    18.     Debug.Print lEnd - lStart
    19.    
    20. End Sub
    21.  
    22. Private Function RandomString(lChars As Long) As String
    23.  
    24.     Dim lCount As Long, bTemp() As Byte
    25.    
    26.     ReDim bTemp(lChars)
    27.     For lCount = LBound(bTemp) To UBound(bTemp)
    28.         bTemp(lCount) = Rand(33, 126)
    29.     Next lCount
    30.  
    31.     RandomString = StrConv(bTemp, vbUnicode)
    32.  
    33. End Function
    34.  
    35. Private Function RandomChanges(ByVal sString As String, lInserts As Long, lDeletes As Long, _
    36.                                Optional lBoundry As Long = 10) As String
    37.  
    38.     Dim lCount As Long, sOut As String, lStart As Long, lEnd As Long, sInsert As String, lLen As Long
    39.     Dim sLeft As String, sRight As String, lTemp As Long
    40.    
    41.     For lCount = 1 To lInserts
    42.         lLen = Len(sString)
    43.         lStart = Rand(1, lLen)
    44.         sInsert = RandomString(Rand(1, lBoundry))
    45.         If lStart = 0 Then
    46.             sString = sInsert & sString
    47.         ElseIf lStart = lLen Then
    48.             sString = sString & sInsert
    49.         Else
    50.             sLeft = Left$(sString, lStart)
    51.             sRight = Right$(sString, lLen - lStart)
    52.             sString = sLeft & sInsert & sRight
    53.         End If
    54.     Next lCount
    55.  
    56.     For lCount = 1 To lDeletes
    57.         lLen = Len(sString)
    58.         lStart = Rand(1, lLen)
    59.         If lStart + lBoundry > lLen Then
    60.             lEnd = lLen
    61.         Else
    62.             lEnd = Rand(lStart, lStart + lBoundry)
    63.         End If
    64.         If lStart = 1 And lEnd = lLen Then
    65.             sString = vbNullString
    66.             Exit For
    67.         End If
    68.         If lEnd = lLen Then
    69.             sString = Left$(sString, lStart - 1)
    70.         ElseIf lStart = 1 Then
    71.             sString = Right$(sString, lLen - lEnd)
    72.         Else
    73.             sLeft = Left$(sString, lStart - 1)
    74.             sRight = Right$(sString, lLen - lEnd)
    75.             sString = sLeft & sRight
    76.         End If
    77.     Next lCount
    78.  
    79.     RandomChanges = sString
    80.  
    81. End Function
    82.  
    83. Private Function Rand(lLow As Long, lHigh As Long) As Long
    84.  
    85.     Rand = Int((lHigh - lLow + 1) * Rnd + lLow)
    86.  
    87. End Function

  2. #2
    Hyperactive Member
    Join Date
    Jun 2004
    Posts
    468

    Re: Need ideas for a string function optimization

    The following tested faster on longer (2000-character) strings, but slower on shorter ones:
    VB Code:
    1. Private Function MySeekMatch(ByVal sOne As String, ByVal sTwo As String) As String
    2.  
    3.     'This function finds the longest substring of sTwo that exists in sOne, and returns it.  No matches
    4.     'returns a vbNullString.
    5.  
    6.     Dim iLongest As String  'Index into sTwo of longest match found.
    7.     Dim lLongest As Long    'Length of longest match found.
    8.  
    9.     Dim i As Long
    10.     Dim j As Long
    11.     Dim k As Long
    12.    
    13.     On Error GoTo ErrHand
    14.    
    15.     iLongest = 0
    16.     lLongest = 0
    17.    
    18.     i = 1
    19.     Do Until i > (Len(sOne) - lLongest)
    20.       j = 1
    21.       Do Until j > (Len(sTwo) - lLongest)
    22.         'Only look for substrings longer than the current longest substring.
    23.         For k = lLongest + 1 To Len(sTwo) - j + 1
    24.           If Mid$(sOne, i, k) <> Mid$(sTwo, j, k) Then Exit For
    25.          
    26.           iLongest = j
    27.           lLongest = k
    28.         Next
    29.        
    30.         j = j + 1
    31.       Loop
    32.      
    33.       i = i + 1
    34.     Loop
    35.    
    36.     If lLongest > 0 Then MySeekMatch = Mid$(sTwo, iLongest, lLongest)
    37.  
    38. ErrHand:
    39.     If Err.Number <> 0 Then
    40.         Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
    41.                     "in SeekMatch.", vbCritical)
    42.     End If
    43.    
    44. End Function
    But you're right, using byte arrays would probably yield a significant increase in speed.

    Somewhere, I have Sedgewick's book of algorithms...
    Last edited by bpd; Mar 3rd, 2006 at 07:09 PM.

  3. #3
    Hyperactive Member
    Join Date
    Jun 2004
    Posts
    468

    Re: Need ideas for a string function optimization

    BTW, your function does not appear to be correct in the case where the largest substring is found at the end of the strings. For example, if you give it two identical, 10-character strings, your function returns only the first 9 characters as the match.

  4. #4

    Thread Starter
    Fanatic Member Comintern's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln, NE
    Posts
    826

    Re: Need ideas for a string function optimization

    Quote Originally Posted by bpd
    BTW, your function does not appear to be correct in the case where the largest substring is found at the end of the strings. For example, if you give it two identical, 10-character strings, your function returns only the first 9 characters as the match.
    No, that's intentional. The current implementation strips all matching characters from the right and the left before the function is called.

  5. #5

    Thread Starter
    Fanatic Member Comintern's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln, NE
    Posts
    826

    Re: Need ideas for a string function optimization

    OK, byte array searches are several orders of magnitude faster, but I can't get them to always be as accurate. Ultimately it won't matter, because the result it returns is always a subset of the truest long match, so I can fix it outside the function. I can change code elsewhere to account for the loss in accuracy in exchange for a hundred fold speed increase .
    VB Code:
    1. Private Function SeekMatchB(ByVal sOne As String, ByVal sTwo As String) As String
    2.  
    3.     'This function finds the longest substring of sTwo that exists in sOne, and returns it.  No matches
    4.     'returns a vbNullString.
    5.  
    6.     Dim sTest As String, lSize As Long, lSeek As Long, lResult As Long, lBoundTwo As Long, lPos As Long
    7.     Dim bOne() As Byte, bTwo() As Byte, lStart As Long, lLength As Long, lBoundOne As Long
    8.     Dim lFloor As Long
    9.  
    10.     On Error GoTo ErrHand
    11.    
    12.     bOne = StrConv(sOne, vbFromUnicode)                 'Write sOne into an array.
    13.     bTwo = StrConv(sTwo, vbFromUnicode)                 'Write sTwo into an array.
    14.    
    15.     lBoundTwo = UBound(bTwo)                            'Store the UBound of bTwo.
    16.     lBoundOne = UBound(bOne)                            'Store the UBound of bOne.
    17.     lFloor = LBound(bTwo)                               'Both arrays LBound will be the same.
    18.    
    19.     For lSeek = lFloor To lBoundTwo
    20.         lPos = lFloor                                   'Reset the string one seek to the start.
    21.         Do Until bTwo(lSeek) = bOne(lPos)               'Seek forward in sOne until there is a match.
    22.             lPos = lPos + 1                             'Increment the seek counter.
    23.             If lPos > lBoundOne Then                    'Check for overbound.
    24.                 lPos = lBoundOne                        'Decrement it again.
    25.                 Exit Do                                 'And bail out of the loop.
    26.             End If
    27.         Loop
    28.         lSize = lSeek                                   'Initialize with the current start index.
    29.         Do Until bTwo(lSize) <> bOne(lPos)              'Concurrently seek forward in both until a mismatch.
    30.             lSize = lSize + 1                           'Increment the sTwo index.
    31.             lPos = lPos + 1                             'Increment the sOne index.
    32.             If lPos > lBoundOne Then                    'Check for overbound.
    33.                 lPos = lBoundOne                        'Decrement it again.
    34.                 lSize = lSize - 1                       'Same with the other indicator.
    35.                 Exit Do                                 'Bail.
    36.             End If
    37.             If lSize > lBoundTwo Then                   'Ditto.
    38.                 lSize = lBoundTwo
    39.                 lPos = lPos - 1
    40.                 Exit Do
    41.             End If
    42.         Loop
    43.         lResult = lSize - lSeek                         'Calculate how long the match was.
    44.         If lResult > lLength Then                       'Check against the last found match.
    45.             lStart = lSeek                              'Update the found starting position.
    46.             lLength = lResult                           'Update the longest find.
    47.         End If
    48.         If lLength + lSeek > lBoundTwo Then Exit For    'Test for early exit.  No possible matches left.
    49.     Next lSeek
    50.    
    51.     If lLength <> 0 Then                                'Make sure there was a match found.
    52.         SeekMatchB = Mid$(sTwo, lStart + 1, lLength)    'Return the resulting matched string.
    53.     End If
    54.    
    55. ErrHand:
    56.     If Err.Number <> 0 Then                             'Check for errors.
    57.         Call MsgBox("Error number " & Err.Number & vbCrLf & Err.Description & vbCrLf & _
    58.                     "in SeekMatchB.", vbCritical)
    59.     End If
    60.    
    61. End Function

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width