Results 1 to 2 of 2

Thread: Matching Names - Levenshtein / Fuzzy Match

Hybrid View

  1. #1

    Thread Starter
    Member
    Join Date
    Feb 2007
    Location
    Netherlands
    Posts
    43

    Matching Names - Levenshtein / Fuzzy Match

    Hello all,

    Is there any component out there that will do fuzzy matching, especially regarding names. I've found and coded some myself but im looking for a more advanced one that takes into account for ex. a missing middle name. Most ive seen only do a 'match' on the number of matching characters. Im looking for a free one since it's not a commercial project and the investment is (probably) not worth it.

    Regards, Jape,
    Last edited by Jape; Mar 18th, 2007 at 12:20 AM.

  2. #2

    Thread Starter
    Member
    Join Date
    Feb 2007
    Location
    Netherlands
    Posts
    43

    Re: Matching Names - Levenshtein / Fuzzy Match

    Ok after some search i finally figured out it was levenhstein algorithm i was looking for. Works great. But, what is the best way to match up names, where the names provided come in a single line, with different formats, for. ex.:

    John Smith
    Smith, John
    SMITH John
    Harry Chris Donaldson
    Donaldson, Chris Harry
    Donaldson Chris Harry etc. etc.

    I have to match around 40.000 names, formatted like above, against an SQL DB with a couple of thousands names in it. Where not all names will be in the SQL Db. I have already done this for sports teams names like below, though 'human' names will require a different aproach where im especially trying to 'tackle' the double front names problem (cant do a simple .lastindexof(" ") to switch front/last name for ex.) and at the same time trying to code it as efficient as possible.

    The way im doing Sports Team Names atm is like this(code is a little long to post):
    1 - Clean up any trailing or ending spaces and replace any whitespace and dividing characters(-) with wildcards
    2 - Try a [SELECT Teamname FROM Teams WHERE Teamname LIKE '%MyCleanedTeamName%'] (very fast)

    --> IF NumberOFResults = 0 or > 1

    3 - Check for custom/user entered Synonyms for the names (Another Select.....)

    --> IF NumberOFResults = 0 or > 1

    4 - Retrieve all Names from SQL DB (SELECT TeamNames from Teams)
    5 - Remove any unnecesary whitespaces, dividing charaters ("-") and common words ("TEAM", "THE", "OF") etc.
    6 - Get Levenshtein Distance of name against every record in DB (Very slow!)

    What would be a effecient way of doing this with names, where there are about 25 times more occurences to look up thus making it prone to slowing the program down a lot ?

    [VS2005/MSSQL 2005]
    Last edited by Jape; Mar 18th, 2007 at 01:07 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width