|
-
May 11th, 2010, 09:19 PM
#1
Thread Starter
Fanatic Member
String Quasi-Matching
I'm looking for a way to compare two strings, and return two percentages for each string's match against the other.
Code:
Private Function quasiString(str1 , str2) as Boolean 'return true if both match => 50%, _
or one matches => 66% and the other is not below a certain threshold%, _
or if either is 100%
'Like passing "The Simpsons", "Simpsons"
'would be 73% match and 100% match, so the function returns true(ignoring spaces)
'Or passing "The Simpsons", "Simpsons, The"
'would be 100% for both, ignoring also punctuation
But I don't just want a binary one to one.
I was thinking of splitting the strings using space(" ") as the delimiter, and comparing each substr individually.
I was also thinking, to reduce the number of comparisons required, to Instr the array of substrs against the other full string, but this would also match strings inside of other substrs("How to: Ship Building" ,"NARUTO Shippiden", 35% and 40% instead of the 'correct' 0% and 0%), so 2 additional checks would be required to ensure it has a space or nothing(start/end of the string) surrounding it.
It'd be useful if it could also match slightly mis-spelled variations of a word. I'm working mostly with proper names, so spell checking is somewhat limited(maybe I could have my app Google it ).
I'm open to your ideas! (Do functions like this already exist? Are they any good?)
Last edited by FireXtol; May 11th, 2010 at 09:23 PM.
-
May 11th, 2010, 11:28 PM
#2
Re: String Quasi-Matching
Would palindromes be 100% matches or would they be 0% matches?
-
May 12th, 2010, 07:47 AM
#3
Thread Starter
Fanatic Member
Re: String Quasi-Matching
Single word palindromes, yes, multi-word palindromes, unlikely(but it's probably possible, given the 'right' string to compare it against). I'm not sure I fully grok you.
My basic goal is to check each word in one string against each word in another string. Allowing for slight typos(not really sure how to accomplish this), grammatical errors, and indefinite white space.
Last edited by FireXtol; May 12th, 2010 at 08:12 AM.
-
May 12th, 2010, 08:00 AM
#4
Re: String Quasi-Matching
Your check could consist of two parts (after both strings are split into by " "). First check each word against all words in the second string to see if they match and remember those that do.
Then do another go through all unmatched words. This time you could count occurances of specific letters to see how similar they are. For example, Tool and Toot would have a 75% similarity. The downside to this is that words Tool and Loto would be a 100% match.
You could also, for this second check, compare words character by character to see how many match at exact positions, something like:
a1. T
a2. o
a3. o
a4. l
compared against
b1. T
b2. o
b3. o
b4. t
You would compare a1 with b1, a2 with b2 etc. The problem here is words that are different in lenghts. You might have to compare the shorter word with all possible subsets of the longer. For example:
a1. T
a2. o
a3. o
a4. l
b1. M
b2. e
Compare b1 with a1, b2 with a2; then, compare b1 with a2, b2 with a3, then b1 with a3 and finaly b2 with a4.
-
May 12th, 2010, 08:18 AM
#5
Re: String Quasi-Matching
In terms of allowing for slight typos, I would recommend this CodeBank thread:
SoundEx and Levenshtein Distance Algorithms
-
May 12th, 2010, 08:25 AM
#6
Thread Starter
Fanatic Member
Re: String Quasi-Matching
Now that gives me an idea! I was thinking something along the lines of checking the first 'part' of a string, and the 'last part' of a string.
So, basically, set a threshold on the difference between two Len(substr(n)), like maybe 2 characters(or just 1)
Code:
Dim substr(1) as String
If abs(len(substr(1)) - len(substr(0))) =< 2 then
left$(substr(0),1) = left$(substr(1),1) Then
'...
end if
end if
Perform a test for the second characters, third, etc... is this is greater than 66% it matches. If that fails to match, attempt it from the right side.
But this would also likely match opposites. Like(100%), unlike(66%); operative(100%), inoperative(82%). Perhaps enforce a minimum of 1 matching character on either side.
Thank ya, baja!
-
May 12th, 2010, 09:03 AM
#7
Thread Starter
Fanatic Member
Re: String Quasi-Matching
 Originally Posted by si_the_geek
Interesting, though it doesn't work consistently.... The Left$(workStr, 4) sometimes exacerbates this. It seems as the word gets longer, the characters eventually become completely insignificant.
Code:
Dim cP As New clsPhoneme
Dim subStr(1) As String
subStr(0) = cP.GetSoundexWord("character")
subStr(1) = cP.GetSoundexWord("characteristic")
Debug.Print subStr(0), subStr(1) 'these are the same...
Debug.Print cP.GetLevenshteinDistance(subStr(0), subStr(1)) 'so this is 0, no difference
Set cP = Nothing
Perhaps I'm not understanding how to use these two functions together. 
EDIT: I removed all the replaceMask sets and checks(and removed the left$(str,4); in the Soundex function), and it appears to work much better now. I posted my changes to the Codebank thread, so hopefully others can benefit from this interesting class.
Last edited by FireXtol; May 12th, 2010 at 09:15 AM.
-
May 12th, 2010, 09:10 AM
#8
Re: String Quasi-Matching
I personally haven't used Levenshtein Distance before (or really know what it is!), but others have recommended it for this kind of thing before.
I don't know why SoundEx is limited to 4 digits by default, the full length seems much better to me!
-
May 12th, 2010, 01:30 PM
#9
Thread Starter
Fanatic Member
Re: String Quasi-Matching
I have this all in a single form, but you can put it wherever you like.
vb Code:
Private Enum eSearchMethod
StrictWord = 1
QuickWord = 2
Phonetic = 4
Exhaustive = 7 'DO ALL OF THEM!
End Enum
Private Function quasiString(Str1 As String, Str2 As String, Optional Confidence1, Optional Confidence2, Optional Threshold As Double = 0.666, Optional Absolute As Boolean = False, Optional CmpMethod As eSearchMethod, Optional PhoneticThreshold As Long = 1) As Boolean
'accepts two strings to compare: Str1, Str2
'returns confidence (optional)
'use a specific threshold, default 66.6% = 2/3rds
'if Absolute, then both confidence MUST be 1(100%) ("American Dad", "American Idol") problems...
'if textual comparisons fail, Phonetic MAY(additonally optionally) performed,...
Dim asStr() As String ''1.a' substrs
Dim bsStr() As String ''2.b' substrs I don't like multi-dimension arrays!
Dim cmpStrA As Variant
Dim cmpStrB As Variant
Dim WordMatches(1) As Long 'how many matches
Dim WordCompares(1) As Long 'how many searches/comparisons
Dim C As Long
Dim bSideGood(1) As Boolean 'is the character next to the quickword term 'space'?
Dim Confidence(1) As Double
If Len(Str1) And Len(Str2) Then
'prepare the substr arrays
If InStr(Str1, " ") > 0 Then
asStr = Split(Str1, " ")
Else
ReDim asStr(0)
asStr(0) = Str1
End If
'and the other string
If InStr(Str2, " ") > 0 Then
bsStr = Split(Str2, " ")
Else
ReDim bsStr(0)
bsStr(0) = Str2
End If
If (CmpMethod And StrictWord) = StrictWord Then
For Each cmpStrB In bsStr
For Each cmpStrA In asStr
'iterate the total
WordCompares(0) = WordCompares(0) + 1
WordCompares(1) = WordCompares(1) + 1
If LCase$(cmpStrA) = LCase$(cmpStrB) Then
'store/iterate to calc hit %
WordMatches(0) = WordMatches(0) + 1
WordMatches(1) = WordMatches(1) + 1
Exit For
ElseIf (CmpMethod And Phonetic) = Phonetic Then
If ComparePhonetic(cmpStrA, cmpStrB) <= PhoneticThreshold Then
WordMatches(0) = WordMatches(0) + 1
WordMatches(1) = WordMatches(1) + 1
Exit For
End If
End If
Next cmpStrA
Next cmpStrB
End If
cmpStrA = ""
cmpStrB = ""
If (CmpMethod And quickWord) = quickWord Then
For Each cmpStrA In asStr
C = InStr(Str2, cmpStrA)
If C > 0 Then
'check left side for space/nothing
If C = 1 Then
bSideGood(0) = True
ElseIf Mid$(Str2, C - 1, 1) = " " Then
bSideGood(0) = True
End If
'check right side for space/nothing
If C + Len(cmpStrA) >= Len(Str2) Then
bSideGood(1) = True
ElseIf Mid$(Str2, C + Len(cmpStrA), 1) = " " Then
bSideGood(1) = True
End If
If bSideGood(0) And bSideGood(1) Then WordMatches(0) = WordMatches(0) + 1
End If
WordCompares(0) = WordCompares(0) + 1
bSideGood(0) = False
bSideGood(1) = False
Next cmpStrA
For Each cmpStrA In bsStr
C = InStr(Str1, cmpStrA)
If C > 0 Then
'check the left side
If C = 1 Then
bSideGood(0) = True
ElseIf Mid$(Str1, C - 1, 1) = " " Then
bSideGood(0) = True
End If
'check right side
If C + Len(cmpStrA) >= Len(Str1) Then
bSideGood(1) = True
ElseIf Mid$(Str1, C + Len(cmpStrA), 1) = " " Then
bSideGood(1) = True
End If
If bSideGood(0) And bSideGood(1) Then WordMatches(1) = WordMatches(1) + 1
End If
WordCompares(1) = WordCompares(1) + 1
bSideGood(0) = False
bSideGood(1) = False
Next cmpStrA
End If
Debug.Print "bleh"; WordMatches(0); WordCompares(0); WordMatches(1); WordCompares(1),
If (CmpMethod And Exhaustive) = Exhaustive Then
Confidence(0) = WordMatches(0) / WordCompares(0)
Confidence(1) = WordMatches(1) / WordCompares(1)
Else
Confidence(0) = WordMatches(0) / (UBound(bsStr) + 1)
Confidence(1) = WordMatches(1) / (UBound(asStr) + 1)
End If
If Confidence(0) > 1 Then
Confidence(1) = Confidence(1) / Confidence(0)
Confidence(0) = 1
ElseIf Confidence(1) > 1 Then
Confidence(0) = Confidence(0) / Confidence(1)
Confidence(1) = 1
End If
If Absolute And ((Confidence(0) = 1) And (Confidence(1) = 1)) Then
quasiString = True
ElseIf Not Absolute And (Confidence(0) > Threshold) And (Confidence(1) > Threshold) Then
quasiString = True
End If
If Not IsMissing(Confidence1) Then Confidence1 = Confidence(0)
If Not IsMissing(Confidence2) Then Confidence2 = Confidence(1)
End If
End Function
Private Function Soundex(argWord As String)
'don't call directly
Dim workStr As String, i As Long
'// Capitalize it to remove ambiguity
argWord = UCase$(argWord)
'// 1. Retain the first letter of the string
workStr = Left$(argWord, 1)
'// 2. Replacement
' [a, e, h, i, o, u, w, y] = 0
' [b, f, p, v] = 1
' [c, g, j, k, q, s, x, z] = 2
' [d, t] = 3
' [l] = 4
' [m, n] = 5
' [r] = 6
For i = 2 To Len(argWord)
Select Case Mid$(argWord, i, 1)
Case "B", "F", "P", "V"
workStr = workStr & Chr$(49) '// 1
Case "C", "G", "J", "K", "Q", "S", "X", "Z"
workStr = workStr & Chr$(50) '// 2
Case "D", "T"
workStr = workStr & Chr$(51) '// 3
Case "L"
workStr = workStr & Chr$(52) '// 4
Case "M", "N"
workStr = workStr & Chr$(53) '// 5
Case "R"
workStr = workStr & Chr$(56) '// 6
'// A, E, H, I, O, U, W, Y do nothing
End Select
Next i
'// 5. Return the first four bytes padded with 0.
If Len(workStr) > 4 Then
Soundex = workStr
Else
Soundex = workStr & Space$(4 - Len(workStr))
End If
End Function
'// Returns the Minimum of 3 numbers
Private Function min3(ByVal n1 As Long, ByVal n2 As Long, ByVal n3 As Long) As Long
'don't call directly
min3 = n1
If n2 < min3 Then min3 = n2
If n3 < min3 Then min3 = n3
End Function
'// Returns the Levenshtein Distance between 2 strings.
Private Function LevenshteinDistance(argStr1 As String, argStr2 As String) As Long
'don't call directly
Dim m As Long, n As Long
Dim editMatrix() As Long, i As Long, j As Long, cost As Long
Dim str1_i As String, str2_j As String
Dim p() As Long, q() As Long, r As Long
Dim X As Long, y As Long
n = Len(argStr1)
m = Len(argStr2)
'If (n = 0) Or (m = 0) Then Exit Function
ReDim editMatrix(n, m) As Long
For i = 0 To n
editMatrix(i, 0) = i
Next
For j = 0 To m
editMatrix(0, j) = j
Next
For i = 1 To n
str1_i = Mid$(argStr1, i, 1)
For j = 1 To m
str2_j = Mid$(argStr2, j, 1)
If str1_i = str2_j Then
cost = 0
Else
cost = 1
End If
editMatrix(i, j) = min3(editMatrix(i - 1, j) + 1, editMatrix(i, j - 1) + 1, editMatrix(i - 1, j - 1) + cost)
Next j
Next i
LevenshteinDistance = editMatrix(n, m)
Erase editMatrix
End Function
Private Function ComparePhonetic(ByVal inputStr1 As String, ByVal inputStr2 As String) As Long
'single interface function, use this
inputStr1 = Soundex(inputStr1)
inputStr2 = Soundex(inputStr2)
If inputStr1 = vbNullString Then
ComparePhonetic = Len(inputStr2)
ElseIf inputStr2 = vbNullString Then
ComparePhonetic = Len(inputStr1)
Else
ComparePhonetic = LevenshteinDistance(inputStr1, inputStr2)
End If
End Function
Sample calls:
vb Code:
Dim dC(1) As Double
Debug.Print quasiString("tool", "toot", dC(0), dC(1), , True, Exhaustive)
Debug.Print dC(0), dC(1)
Debug.Print quasiString("tool", "toot", dC(0), dC(1), , True, Phonetic or StrictWord, 1) 'falls back on Phonetic comparison if a textual match is not found
Debug.Print dC(0), dC(1)
Debug.Print quasiString("top gear", "gear top", dC(0), dC(1), , True, QuickWord)
Debug.Print dC(0), dC(1)
Comments, suggestions, improvements? Thanks everyone!
EDIT: Absolute now checks for both Confidence values to be = 1. The elseif should of checked for Not Absolute... fixed.
Last edited by FireXtol; May 12th, 2010 at 03:47 PM.
Reason: Fixed some more logic....
-
May 12th, 2010, 02:27 PM
#10
Re: String Quasi-Matching
Fire
Comment -- looks like you found this site 
Spoo
-
May 12th, 2010, 02:57 PM
#11
Re: String Quasi-Matching
 Originally Posted by MartinLiss
Would palindromes be 100% matches or would they be 0% matches?
Sorry but that's probably the dumbest question I've ever asked.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|