|
-
Feb 22nd, 2010, 02:30 PM
#7
Re: String metrics performance
This sounds like a good idea:
 Originally Posted by stlaural
What I thought I could do next is order the strings alphabeticaly and compare a string only with the other strings that start with the same character. I might lose a couple of matches but as these are all company names it shouldn't be so bad as Most of the time people don't do typo on first character 
...but there is definitely room for error there. For example, there are some words that start with S but sound like they start with C.
I would feel more comfortable not just comparing the same first letter, but also any others that sound similar. To do that you would need to create a list for each of the 26 letters to say which others it can sound like.
A similar thing would be to check the length of the strings, and skip the check if the difference is too big (eg: 15 characters vs 5).
One problem with this (and probably any other method) is that it is likely to ignore duplicates where one of them uses words like "the" an "of", but the other doesn't.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|