-
Feb 22nd, 2010, 07:26 AM
#1
Thread Starter
Fanatic Member
[RESOLVED] String metrics performance
Admins : please move the thread if you think its not in the right forum. thanks.
Hi all, I am having a performance issue so I thought I could just open this thread to get advices.
I am currently building an application that compares multiple strings (mainly company names) from tables in databases that can be written differently. I'am using Damerau–Levenshtein algorithm to do so.
The problem I have is that I can have 25000 strings to compares with each other so it represents 625 000 000 operations, each one running the algorithm.
lets say its quite long, like, WAY TOO LONG
I will for sure use multithread with a minimum of 2 threads a max of 4 I guess to make it faster.
So the question is simple : Does anyone has any other idea/method to optimize the performance of that kind of operation ?
Thanks a lot for any suggestion.
Last edited by stlaural; Feb 22nd, 2010 at 09:09 AM.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 22nd, 2010, 09:29 AM
#2
Re: String metrics performance
-
Feb 22nd, 2010, 09:34 AM
#3
Thread Starter
Fanatic Member
Re: String metrics performance
Yup, sorry I forgot to mention, its in C#.net actually but I didn't want to post exclusively on C# forum as this kind of optimisation could interest anyone.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 22nd, 2010, 09:37 AM
#4
Re: String metrics performance
Originally Posted by stlaural
Yup, sorry I forgot to mention, its in C#.net actually but I didn't want to post exclusively on C# forum as this kind of optimisation could interest anyone.
Yea, thats cool. Will leave here for a bit.
Thanks
Pino
-
Feb 22nd, 2010, 11:01 AM
#5
Re: String metrics performance
String comparisons are slower than pretty nearly any other, and you are doing a pretty large number of comparisons. You might benefit from profiling different parts of the algorithm to look for any glaring inefficiencies, but the link you provided shows a fairly tight comparison, so I wouldn't expect that the same results could be obtained via a much faster method. That would leave only two areas that appear worthy of exploration:
1) How you load the data. If you do this poorly, then performance will suffer, but there may not be a good way to go faster.
2) Any benefits that might be derived from the comparisons chosen. For example, if A is very similar to B, and C is very dissimilar to A, then C will be very dissimilar to B. If you need to know the exact level of dissimilarity, then this observation would do you no good. If, however, you want some threshold, then perhaps this could be used to pare down the number of comparisons.
My usual boring signature: Nothing
-
Feb 22nd, 2010, 01:11 PM
#6
Thread Starter
Fanatic Member
Re: String metrics performance
thanks for the answer Shaggy.
I know that String comparisons are the slowest one, but I thind it is possible to do what I want in a matter of minutes. I don't mind waiting 15min if it can save me hours in the end.
To answer your first point, I first load the data from a simple SELECT DISTINCT query into a listbox. Then for each entry of my list box, I normalize the company name (I take away accents, ponctuations etc.) and add it to an arraylist, which I think is fast enough. What is really long is the next step which is the comparison using the algorithm.
For now what I did to minimize the number of operations is that I never compare the same two string, so the first one is compared with the 24999 other, the second one with the 24998 others.... So I basically cut the number of operations in two.
What I thought I could do next is order the strings alphabeticaly and compare a string only with the other strings that start with the same character. I might lose a couple of matches but as these are all company names it shouldn't be so bad as Most of the time people don't do typo on first character
thanks again.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 22nd, 2010, 01:33 PM
#7
Re: String metrics performance
The listbox could be a point of concern, but if you are not writing to it very often, it would be a one time cost, which would be forgettable. A List would be superior to an ArrayList as long as you are using something from 2005 or later (List (of T) was only introduced with 2005). The performance won't matter much, if any, though, it is more an ease of use issue.
What is the ultimate goal? That algorithm is looking for similarity between names, and it sounds like you might be doing something considerably less demanding.
My usual boring signature: Nothing
-
Feb 22nd, 2010, 01:55 PM
#8
Thread Starter
Fanatic Member
Re: String metrics performance
Ultimately, this application will allow us to link data from multiple database, based on the company name field. The problem were having here where I work is that we have a lot of data in different system/database that are not linked by any field. We want to be able to do so using the company name field but it is not always entered the same way. This tool will allow us to build the Master company name list. The master will contain the Official company name with all the known ways that it has been entered in our different databases.
For example :
Code:
Company #1 inc. (Master name)
company 1 inc.
compani #1 incorporated.
COMPANY #1
Companu #1 inc
etc.
Then once we are ready to regroup our data and clean it, this list will allow us to do so.
the listbox is definitely not a problem for me, it take between 1 ans 4 seconds to fill and it's a one shot deal. I'll sure have a look at the List instead of ArrayList. Even if the gain is not noticable It will at least make my code better.
thanks a lot for the suggestions.
If anyone as other succestions, feel free to post.
Last edited by stlaural; Feb 22nd, 2010 at 02:05 PM.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 22nd, 2010, 02:30 PM
#9
Re: String metrics performance
This sounds like a good idea:
Originally Posted by stlaural
What I thought I could do next is order the strings alphabeticaly and compare a string only with the other strings that start with the same character. I might lose a couple of matches but as these are all company names it shouldn't be so bad as Most of the time people don't do typo on first character
...but there is definitely room for error there. For example, there are some words that start with S but sound like they start with C.
I would feel more comfortable not just comparing the same first letter, but also any others that sound similar. To do that you would need to create a list for each of the 26 letters to say which others it can sound like.
A similar thing would be to check the length of the strings, and skip the check if the difference is too big (eg: 15 characters vs 5).
One problem with this (and probably any other method) is that it is likely to ignore duplicates where one of them uses words like "the" an "of", but the other doesn't.
-
Feb 22nd, 2010, 02:54 PM
#10
Thread Starter
Fanatic Member
Re: String metrics performance
Originally Posted by si_the_geek
I would feel more comfortable not just comparing the same first letter, but also any others that sound similar. To do that you would need to create a list for each of the 26 letters to say which others it can sound like.
Right, that would help minimize the possible missed matches. I'll try with and without to see the differences in result.
Originally Posted by si_the_geek
A similar thing would be to check the length of the strings, and skip the check if the difference is too big (eg: 15 characters vs 5).
That also will help accelerate the process as the "distance" between those strings would be too high anyway to be considered a match. I'll also set the maximum difference in length based on the maximum allowed "distance" between strings to be considered a match.
Thanks a million guys, with this I should be able to reduce the process time significantly and keep a good level of accuracy.
I'll mark thread as resolved when everything is tested in a couple of days, meanwhile I'll see if anyone has other ideas and post mine if I have some new one.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 24th, 2010, 08:19 AM
#11
Thread Starter
Fanatic Member
Re: String metrics performance
Well, I finally tested most of the suggested modifications and my comparison process now runs in about 6 minutes on a dual core and should run much faster on a quad as it will start more threads.
Thanks to you all for your suggestions.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 24th, 2010, 09:56 AM
#12
Re: [RESOLVED] String metrics performance
Out of curiosity, how are you setting the number of threads to start? Is that something you add to the code, or are you detecting the number of processors?
My usual boring signature: Nothing
-
Feb 24th, 2010, 10:53 AM
#13
Thread Starter
Fanatic Member
Re: [RESOLVED] String metrics performance
I used this small piece of code :
vb.net Code:
Environment.ProcessorCount
then I fill a list with BackgroundWorkers base on the number that was returned.
I still have to test it on my personnal computer on which I have an Intel i7 920, I'm curious to see if it will retrun 4, as it have 4 cores or 8 as it uses Hyperthreading. Based on what I found on the web it should return the number of logical processors so on my i7 it should return 8.
Last edited by stlaural; Feb 24th, 2010 at 10:58 AM.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 24th, 2010, 11:31 AM
#14
Re: [RESOLVED] String metrics performance
Hunh. I was looking around for something like that, but what I was finding on line seemed to suggest that it would be innacurate. I guess I should give it a try, though.
My usual boring signature: Nothing
-
Feb 24th, 2010, 01:21 PM
#15
Thread Starter
Fanatic Member
Re: [RESOLVED] String metrics performance
Until now I have to say it works pretty fine, I handled the concurrency and tested everything on my laptop, which has an Intel Centrino dual core and the application starts 2 Backgroundworkers as expected as there is no Hyper Threading. I'll keep you posted after testing with my i7 this weekend.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Feb 27th, 2010, 02:49 PM
#16
Thread Starter
Fanatic Member
Re: [RESOLVED] String metrics performance
Tested it on my personnal PC with i7 and Environment.ProcessorCount returns 8 as expected.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Mar 3rd, 2010, 02:36 PM
#17
Thread Starter
Fanatic Member
Re: [RESOLVED] String metrics performance
FYI
I thought that comparing only strings that start with the same character would same me some time, as it would eliminate a lot of loops but after doing some searches and test I figured out that the substring method was way too time consuming. When I use it, comparing 25000 strings with each other took about 8 to 9 minutes on a dual core laptop which is not too bad compared to the first attemped i made, but when I don't use it it takes about 8 secondes !
I think I reached a good performance level.
thanks to anyone who helped.
Alex
.NET developer
"No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)
Things to consider before posting.
Don't forget to rate the posts if they helped and mark thread as resolved when they are.
.Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
My fresh new blog : writingthecode, even if I don't post much.
System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0
-
Mar 3rd, 2010, 04:14 PM
#18
Re: [RESOLVED] String metrics performance
String operations are about as slow as anything you can do. Unfortunate thing, that, as we tend to do them plenty.
My usual boring signature: Nothing
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|