Soundex
From Wikipedia, the free encyclopedia
Soundex is a phonetic algorithm for indexing names by their sound when pronounced in English. The basic aim is for names with the same pronunciation to be encoded to the same string so that matching can occur despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used (incorrectly) as a synonym for "phonetic algorithm".
The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example, the HYPERLINK "http://en.wikipedia.org/wiki/Labial"
Levenshtein distance
From Wikipedia, the free encyclopedia
In information theory and computer science, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is useful in applications that need to determine how similar two strings are, such as spell checkers.
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
kitten sitten (substitution of for
sitten sittin (substitution of for
sittin sitting (insert at the end)
This is a class so using it will be very easy.
Where do we use these algorithms. Well in writing Spell checkers and such software where words have to be computed into some kind of values for proximity calcualtions, etc.
I was having issues with large words. So I modified this function in the class:
vb Code:
Private Function Soundex(argWord As String)
Dim workStr As String, i As Long
'// Capitalize it to remove ambiguity
argWord = UCase$(argWord)
'// 1. Retain the first letter of the string
workStr = Left$(argWord, 1)
'// 2. Replacement
' [a, e, h, i, o, u, w, y] = 0
' [b, f, p, v] = 1
' [c, g, j, k, q, s, x, z] = 2
' [d, t] = 3
' [l] = 4
' [m, n] = 5
' [r] = 6
For i = 2 To Len(argWord)
Select Case Mid$(argWord, i, 1)
Case "B", "F", "P", "V"
workStr = workStr & Chr$(49) '// 1
Case "C", "G", "J", "K", "Q", "S", "X", "Z"
workStr = workStr & Chr$(50) '// 2
Case "D", "T"
workStr = workStr & Chr$(51) '// 3
Case "L"
workStr = workStr & Chr$(52) '// 4
Case "M", "N"
workStr = workStr & Chr$(53) '// 5
Case "R"
workStr = workStr & Chr$(56) '// 6
'// A, E, H, I, O, U, W, Y do nothing
End Select
Next i
'// 5. Return the first four bytes padded with 0
'fix: for long string compatible, do not return only the first four bytes, but all of them
'fix2: removed padding, seemed like it did not make any difference to the GetLevenshteinDistance function
Soundex = workStr
End Function
It seems to work much better for long words, but it may have unintended side effects. Additionally, the padding of zeros seems unnecessary, so that was removed, too.
Here's an example of how to use this(afaik!):
vb Code:
Dim cP As New clsPhoneme
Dim subStr(1) As String
subStr(0) = cP.GetSoundexWord("electromagnet")
subStr(1) = cP.GetSoundexWord("electromagnetic")
Debug.Print subStr(0), subStr(1)
Debug.Print cP.GetLevenshteinDistance(subStr(0), subStr(1)) 'should return 1 if you used my modified Soundex function, otherwise it'll be zero
Set cP = Nothing
Last edited by FireXtol; May 12th, 2010 at 09:33 AM.