Results 1 to 11 of 11

Thread: String Quasi-Matching

  1. #1

    Thread Starter
    Fanatic Member FireXtol's Avatar
    Join Date
    Apr 2010
    Posts
    874

    String Quasi-Matching

    I'm looking for a way to compare two strings, and return two percentages for each string's match against the other.

    Code:
    Private Function quasiString(str1 , str2) as Boolean 'return true if both match => 50%, _
    or one matches => 66% and the other is not below a certain threshold%, _
    or if either is 100%
    
    'Like passing "The Simpsons", "Simpsons"
    'would be 73% match and 100% match, so the function returns true(ignoring spaces)
    'Or passing "The Simpsons", "Simpsons, The"
    'would be 100% for both, ignoring also punctuation
    But I don't just want a binary one to one.

    I was thinking of splitting the strings using space(" ") as the delimiter, and comparing each substr individually.

    I was also thinking, to reduce the number of comparisons required, to Instr the array of substrs against the other full string, but this would also match strings inside of other substrs("How to: Ship Building" ,"NARUTO Shippiden", 35% and 40% instead of the 'correct' 0% and 0%), so 2 additional checks would be required to ensure it has a space or nothing(start/end of the string) surrounding it.

    It'd be useful if it could also match slightly mis-spelled variations of a word. I'm working mostly with proper names, so spell checking is somewhat limited(maybe I could have my app Google it ).

    I'm open to your ideas! (Do functions like this already exist? Are they any good?)
    Last edited by FireXtol; May 11th, 2010 at 09:23 PM.

  2. #2

  3. #3

    Thread Starter
    Fanatic Member FireXtol's Avatar
    Join Date
    Apr 2010
    Posts
    874

    Re: String Quasi-Matching

    Single word palindromes, yes, multi-word palindromes, unlikely(but it's probably possible, given the 'right' string to compare it against). I'm not sure I fully grok you.

    My basic goal is to check each word in one string against each word in another string. Allowing for slight typos(not really sure how to accomplish this), grammatical errors, and indefinite white space.
    Last edited by FireXtol; May 12th, 2010 at 08:12 AM.

  4. #4
    Next Of Kin baja_yu's Avatar
    Join Date
    Aug 2002
    Location
    /dev/root
    Posts
    5,989

    Re: String Quasi-Matching

    Your check could consist of two parts (after both strings are split into by " "). First check each word against all words in the second string to see if they match and remember those that do.

    Then do another go through all unmatched words. This time you could count occurances of specific letters to see how similar they are. For example, Tool and Toot would have a 75% similarity. The downside to this is that words Tool and Loto would be a 100% match.

    You could also, for this second check, compare words character by character to see how many match at exact positions, something like:

    a1. T
    a2. o
    a3. o
    a4. l

    compared against

    b1. T
    b2. o
    b3. o
    b4. t

    You would compare a1 with b1, a2 with b2 etc. The problem here is words that are different in lenghts. You might have to compare the shorter word with all possible subsets of the longer. For example:

    a1. T
    a2. o
    a3. o
    a4. l

    b1. M
    b2. e

    Compare b1 with a1, b2 with a2; then, compare b1 with a2, b2 with a3, then b1 with a3 and finaly b2 with a4.

  5. #5
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,974

    Re: String Quasi-Matching

    In terms of allowing for slight typos, I would recommend this CodeBank thread:
    SoundEx and Levenshtein Distance Algorithms

  6. #6

    Thread Starter
    Fanatic Member FireXtol's Avatar
    Join Date
    Apr 2010
    Posts
    874

    Re: String Quasi-Matching

    Now that gives me an idea! I was thinking something along the lines of checking the first 'part' of a string, and the 'last part' of a string.

    So, basically, set a threshold on the difference between two Len(substr(n)), like maybe 2 characters(or just 1)

    Code:
    Dim substr(1) as String
    If abs(len(substr(1)) - len(substr(0))) =< 2 then
      left$(substr(0),1) = left$(substr(1),1) Then
        '...
      end if
    end if
    Perform a test for the second characters, third, etc... is this is greater than 66&#37; it matches. If that fails to match, attempt it from the right side.

    But this would also likely match opposites. Like(100%), unlike(66%); operative(100%), inoperative(82%). Perhaps enforce a minimum of 1 matching character on either side.

    Thank ya, baja!

  7. #7

    Thread Starter
    Fanatic Member FireXtol's Avatar
    Join Date
    Apr 2010
    Posts
    874

    Re: String Quasi-Matching

    Quote Originally Posted by si_the_geek View Post
    In terms of allowing for slight typos, I would recommend this CodeBank thread:
    SoundEx and Levenshtein Distance Algorithms
    Interesting, though it doesn't work consistently.... The Left$(workStr, 4) sometimes exacerbates this. It seems as the word gets longer, the characters eventually become completely insignificant.

    Code:
    Dim cP As New clsPhoneme
    Dim subStr(1) As String
    
    subStr(0) = cP.GetSoundexWord("character")
    subStr(1) = cP.GetSoundexWord("characteristic")
    
    Debug.Print subStr(0), subStr(1) 'these are the same...
    
    Debug.Print cP.GetLevenshteinDistance(subStr(0), subStr(1)) 'so this is 0, no difference
    Set cP = Nothing
    Perhaps I'm not understanding how to use these two functions together.

    EDIT: I removed all the replaceMask sets and checks(and removed the left$(str,4); in the Soundex function), and it appears to work much better now. I posted my changes to the Codebank thread, so hopefully others can benefit from this interesting class.
    Last edited by FireXtol; May 12th, 2010 at 09:15 AM.

  8. #8
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,974

    Re: String Quasi-Matching

    I personally haven't used Levenshtein Distance before (or really know what it is!), but others have recommended it for this kind of thing before.

    I don't know why SoundEx is limited to 4 digits by default, the full length seems much better to me!

  9. #9

    Thread Starter
    Fanatic Member FireXtol's Avatar
    Join Date
    Apr 2010
    Posts
    874

    Re: String Quasi-Matching

    I have this all in a single form, but you can put it wherever you like.

    vb Code:
    1. Private Enum eSearchMethod
    2.   StrictWord = 1
    3.   QuickWord = 2
    4.   Phonetic = 4
    5.   Exhaustive = 7 'DO ALL OF THEM!
    6. End Enum
    7.  
    8. Private Function quasiString(Str1 As String, Str2 As String, Optional Confidence1, Optional Confidence2, Optional Threshold As Double = 0.666, Optional Absolute As Boolean = False, Optional CmpMethod As eSearchMethod, Optional PhoneticThreshold As Long = 1) As Boolean
    9. 'accepts two strings to compare: Str1, Str2
    10. 'returns confidence (optional)
    11. 'use a specific threshold, default 66.6% = 2/3rds
    12. 'if Absolute, then both confidence MUST be 1(100%) ("American Dad", "American Idol") problems...
    13. 'if textual comparisons fail, Phonetic MAY(additonally optionally) performed,...
    14.  
    15. Dim asStr() As String ''1.a' substrs
    16. Dim bsStr() As String ''2.b' substrs I don't like multi-dimension arrays!
    17.  
    18. Dim cmpStrA As Variant
    19. Dim cmpStrB As Variant
    20.  
    21. Dim WordMatches(1) As Long 'how many matches
    22. Dim WordCompares(1) As Long 'how many searches/comparisons
    23. Dim C As Long
    24. Dim bSideGood(1) As Boolean 'is the character next to the quickword term 'space'?
    25. Dim Confidence(1) As Double
    26. If Len(Str1) And Len(Str2) Then
    27.  
    28.   'prepare the substr arrays
    29.   If InStr(Str1, " ") > 0 Then
    30.     asStr = Split(Str1, " ")
    31.   Else
    32.     ReDim asStr(0)
    33.     asStr(0) = Str1
    34.   End If
    35.   'and the other string
    36.   If InStr(Str2, " ") > 0 Then
    37.     bsStr = Split(Str2, " ")
    38.   Else
    39.     ReDim bsStr(0)
    40.     bsStr(0) = Str2
    41.   End If
    42.    
    43.   If (CmpMethod And StrictWord) = StrictWord Then
    44.    
    45.     For Each cmpStrB In bsStr
    46.      
    47.       For Each cmpStrA In asStr
    48.         'iterate the total
    49.         WordCompares(0) = WordCompares(0) + 1
    50.         WordCompares(1) = WordCompares(1) + 1
    51.                
    52.         If LCase$(cmpStrA) = LCase$(cmpStrB) Then
    53.           'store/iterate to calc hit %
    54.           WordMatches(0) = WordMatches(0) + 1
    55.           WordMatches(1) = WordMatches(1) + 1
    56.           Exit For
    57.         ElseIf (CmpMethod And Phonetic) = Phonetic Then
    58.           If ComparePhonetic(cmpStrA, cmpStrB) <= PhoneticThreshold Then
    59.             WordMatches(0) = WordMatches(0) + 1
    60.             WordMatches(1) = WordMatches(1) + 1
    61.             Exit For
    62.           End If
    63.         End If
    64.        
    65.       Next cmpStrA
    66.     Next cmpStrB
    67.    
    68.    
    69.    
    70.   End If
    71.  
    72.    cmpStrA = ""
    73.    cmpStrB = ""
    74.    
    75.   If (CmpMethod And quickWord) = quickWord Then
    76.     For Each cmpStrA In asStr
    77.       C = InStr(Str2, cmpStrA)
    78.       If C > 0 Then
    79.         'check left side for space/nothing
    80.         If C = 1 Then
    81.           bSideGood(0) = True
    82.         ElseIf Mid$(Str2, C - 1, 1) = " " Then
    83.           bSideGood(0) = True
    84.         End If
    85.        
    86.         'check right side for space/nothing
    87.         If C + Len(cmpStrA) >= Len(Str2) Then
    88.           bSideGood(1) = True
    89.         ElseIf Mid$(Str2, C + Len(cmpStrA), 1) = " " Then
    90.           bSideGood(1) = True
    91.         End If
    92.        
    93.         If bSideGood(0) And bSideGood(1) Then WordMatches(0) = WordMatches(0) + 1
    94.       End If
    95.       WordCompares(0) = WordCompares(0) + 1
    96.       bSideGood(0) = False
    97.       bSideGood(1) = False
    98.     Next cmpStrA
    99.    
    100.     For Each cmpStrA In bsStr
    101.       C = InStr(Str1, cmpStrA)
    102.       If C > 0 Then
    103.         'check the left side
    104.         If C = 1 Then
    105.           bSideGood(0) = True
    106.         ElseIf Mid$(Str1, C - 1, 1) = " " Then
    107.           bSideGood(0) = True
    108.         End If
    109.        
    110.         'check right side
    111.         If C + Len(cmpStrA) >= Len(Str1) Then
    112.           bSideGood(1) = True
    113.         ElseIf Mid$(Str1, C + Len(cmpStrA), 1) = " " Then
    114.           bSideGood(1) = True
    115.         End If
    116.        
    117.         If bSideGood(0) And bSideGood(1) Then WordMatches(1) = WordMatches(1) + 1
    118.       End If
    119.       WordCompares(1) = WordCompares(1) + 1
    120.       bSideGood(0) = False
    121.       bSideGood(1) = False
    122.     Next cmpStrA
    123.   End If
    124.      
    125.     Debug.Print "bleh"; WordMatches(0); WordCompares(0); WordMatches(1); WordCompares(1),
    126.    
    127.   If (CmpMethod And Exhaustive) = Exhaustive Then
    128.     Confidence(0) = WordMatches(0) / WordCompares(0)
    129.     Confidence(1) = WordMatches(1) / WordCompares(1)
    130.   Else
    131.     Confidence(0) = WordMatches(0) / (UBound(bsStr) + 1)
    132.     Confidence(1) = WordMatches(1) / (UBound(asStr) + 1)
    133.   End If
    134.  
    135.   If Confidence(0) > 1 Then
    136.     Confidence(1) = Confidence(1) / Confidence(0)
    137.     Confidence(0) = 1
    138.   ElseIf Confidence(1) > 1 Then
    139.     Confidence(0) = Confidence(0) / Confidence(1)
    140.     Confidence(1) = 1
    141.   End If
    142.  
    143.  
    144.     If Absolute And ((Confidence(0) = 1) And (Confidence(1) = 1)) Then
    145.       quasiString = True
    146.     ElseIf Not Absolute And (Confidence(0) > Threshold) And (Confidence(1) > Threshold) Then
    147.       quasiString = True
    148.     End If
    149.  
    150.    
    151.   If Not IsMissing(Confidence1) Then Confidence1 = Confidence(0)
    152.   If Not IsMissing(Confidence2) Then Confidence2 = Confidence(1)
    153.  
    154. End If
    155. End Function
    156.  
    157. Private Function Soundex(argWord As String)
    158. 'don't call directly
    159. Dim workStr As String, i As Long
    160.  
    161.     '// Capitalize it to remove ambiguity
    162.     argWord = UCase$(argWord)
    163.    
    164.     '// 1. Retain the first letter of the string
    165.     workStr = Left$(argWord, 1)
    166.    
    167.     '// 2. Replacement
    168.     '   [a, e, h, i, o, u, w, y] = 0
    169.     '   [b, f, p, v] = 1
    170.     '   [c, g, j, k, q, s, x, z] = 2
    171.     '   [d, t] = 3
    172.     '   [l] = 4
    173.     '   [m, n] = 5
    174.     '   [r] = 6
    175.    
    176.     For i = 2 To Len(argWord)
    177.         Select Case Mid$(argWord, i, 1)
    178.             Case "B", "F", "P", "V"
    179.                     workStr = workStr & Chr$(49) '// 1
    180.             Case "C", "G", "J", "K", "Q", "S", "X", "Z"
    181.                     workStr = workStr & Chr$(50) '// 2
    182.             Case "D", "T"
    183.                     workStr = workStr & Chr$(51) '// 3
    184.             Case "L"
    185.                     workStr = workStr & Chr$(52) '// 4
    186.             Case "M", "N"
    187.                     workStr = workStr & Chr$(53) '// 5
    188.             Case "R"
    189.                     workStr = workStr & Chr$(56) '// 6
    190.             '// A, E, H, I, O, U, W, Y do nothing
    191.         End Select
    192.     Next i
    193.    
    194.     '// 5. Return the first four bytes padded with 0.
    195.     If Len(workStr) > 4 Then
    196.         Soundex = workStr
    197.     Else
    198.         Soundex = workStr & Space$(4 - Len(workStr))
    199.     End If
    200. End Function
    201.  
    202. '// Returns the Minimum of 3 numbers
    203. Private Function min3(ByVal n1 As Long, ByVal n2 As Long, ByVal n3 As Long) As Long
    204. 'don't call directly
    205.     min3 = n1
    206.     If n2 < min3 Then min3 = n2
    207.     If n3 < min3 Then min3 = n3
    208. End Function
    209.  
    210. '// Returns the Levenshtein Distance between 2 strings.
    211. Private Function LevenshteinDistance(argStr1 As String, argStr2 As String) As Long
    212. 'don't call directly
    213. Dim m As Long, n As Long
    214. Dim editMatrix() As Long, i As Long, j As Long, cost As Long
    215. Dim str1_i As String, str2_j As String
    216. Dim p() As Long, q() As Long, r As Long
    217. Dim X As Long, y As Long
    218.  
    219.     n = Len(argStr1)
    220.     m = Len(argStr2)
    221.    
    222.     'If (n = 0) Or (m = 0) Then Exit Function
    223.     ReDim editMatrix(n, m) As Long
    224.    
    225.    
    226.     For i = 0 To n
    227.         editMatrix(i, 0) = i
    228.     Next
    229.    
    230.     For j = 0 To m
    231.         editMatrix(0, j) = j
    232.     Next
    233.    
    234.     For i = 1 To n
    235.         str1_i = Mid$(argStr1, i, 1)
    236.         For j = 1 To m
    237.             str2_j = Mid$(argStr2, j, 1)
    238.             If str1_i = str2_j Then
    239.                 cost = 0
    240.             Else
    241.                 cost = 1
    242.             End If
    243.            
    244.             editMatrix(i, j) = min3(editMatrix(i - 1, j) + 1, editMatrix(i, j - 1) + 1, editMatrix(i - 1, j - 1) + cost)
    245.         Next j
    246.     Next i
    247.            
    248.     LevenshteinDistance = editMatrix(n, m)
    249.     Erase editMatrix
    250. End Function
    251.  
    252. Private Function ComparePhonetic(ByVal inputStr1 As String, ByVal inputStr2 As String) As Long
    253. 'single interface function, use this
    254. inputStr1 = Soundex(inputStr1)
    255. inputStr2 = Soundex(inputStr2)
    256.  
    257.     If inputStr1 = vbNullString Then
    258.         ComparePhonetic = Len(inputStr2)
    259.     ElseIf inputStr2 = vbNullString Then
    260.         ComparePhonetic = Len(inputStr1)
    261.     Else
    262.         ComparePhonetic = LevenshteinDistance(inputStr1, inputStr2)
    263.     End If
    264. End Function

    Sample calls:
    vb Code:
    1. Dim dC(1) As Double
    2. Debug.Print quasiString("tool", "toot", dC(0), dC(1), , True, Exhaustive)
    3. Debug.Print dC(0), dC(1)
    4.  
    5. Debug.Print quasiString("tool", "toot", dC(0), dC(1), , True, Phonetic or StrictWord, 1) 'falls back on Phonetic comparison if a textual match is not found
    6. Debug.Print dC(0), dC(1)
    7.  
    8. Debug.Print quasiString("top gear", "gear top", dC(0), dC(1), , True, QuickWord)
    9. Debug.Print dC(0), dC(1)

    Comments, suggestions, improvements? Thanks everyone!

    EDIT: Absolute now checks for both Confidence values to be = 1. The elseif should of checked for Not Absolute... fixed.
    Last edited by FireXtol; May 12th, 2010 at 03:47 PM. Reason: Fixed some more logic....

  10. #10
    PowerPoster Spoo's Avatar
    Join Date
    Nov 2008
    Location
    Right Coast
    Posts
    2,656

    Re: String Quasi-Matching

    Fire

    Comment -- looks like you found this site

    Spoo

  11. #11

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width