Matching Names - Levenshtein / Fuzzy Match
Hello all,
Is there any component out there that will do fuzzy matching, especially regarding names. I've found and coded some myself but im looking for a more advanced one that takes into account for ex. a missing middle name. Most ive seen only do a 'match' on the number of matching characters. Im looking for a free one since it's not a commercial project and the investment is (probably) not worth it.
Regards, Jape,
Re: Matching Names - Levenshtein / Fuzzy Match
Ok after some search i finally figured out it was levenhstein algorithm i was looking for. Works great. But, what is the best way to match up names, where the names provided come in a single line, with different formats, for. ex.:
John Smith
Smith, John
SMITH John
Harry Chris Donaldson
Donaldson, Chris Harry
Donaldson Chris Harry etc. etc.
I have to match around 40.000 names, formatted like above, against an SQL DB with a couple of thousands names in it. Where not all names will be in the SQL Db. I have already done this for sports teams names like below, though 'human' names will require a different aproach where im especially trying to 'tackle' the double front names problem (cant do a simple .lastindexof(" ") to switch front/last name for ex.) and at the same time trying to code it as efficient as possible.
The way im doing Sports Team Names atm is like this(code is a little long to post):
1 - Clean up any trailing or ending spaces and replace any whitespace and dividing characters(-) with wildcards
2 - Try a [SELECT Teamname FROM Teams WHERE Teamname LIKE '%MyCleanedTeamName%'] (very fast)
--> IF NumberOFResults = 0 or > 1
3 - Check for custom/user entered Synonyms for the names (Another Select.....)
--> IF NumberOFResults = 0 or > 1
4 - Retrieve all Names from SQL DB (SELECT TeamNames from Teams)
5 - Remove any unnecesary whitespaces, dividing charaters ("-") and common words ("TEAM", "THE", "OF") etc.
6 - Get Levenshtein Distance of name against every record in DB (Very slow!)
What would be a effecient way of doing this with names, where there are about 25 times more occurences to look up thus making it prone to slowing the program down a lot ?
[VS2005/MSSQL 2005]