Results 1 to 18 of 18

Thread: [RESOLVED] String metrics performance

  1. #1

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Resolved [RESOLVED] String metrics performance

    Admins : please move the thread if you think its not in the right forum. thanks.

    Hi all, I am having a performance issue so I thought I could just open this thread to get advices.


    I am currently building an application that compares multiple strings (mainly company names) from tables in databases that can be written differently. I'am using Damerau–Levenshtein algorithm to do so.

    The problem I have is that I can have 25000 strings to compares with each other so it represents 625 000 000 operations, each one running the algorithm.

    lets say its quite long, like, WAY TOO LONG

    I will for sure use multithread with a minimum of 2 threads a max of 4 I guess to make it faster.

    So the question is simple : Does anyone has any other idea/method to optimize the performance of that kind of operation ?

    Thanks a lot for any suggestion.
    Last edited by stlaural; Feb 22nd, 2010 at 09:09 AM.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  2. #2
    Super Moderator
    Join Date
    Dec 2003
    Posts
    4,787

    Re: String metrics performance

    Alex,

    Is this in .net?

  3. #3

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: String metrics performance

    Yup, sorry I forgot to mention, its in C#.net actually but I didn't want to post exclusively on C# forum as this kind of optimisation could interest anyone.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  4. #4
    Super Moderator
    Join Date
    Dec 2003
    Posts
    4,787

    Re: String metrics performance

    Quote Originally Posted by stlaural View Post
    Yup, sorry I forgot to mention, its in C#.net actually but I didn't want to post exclusively on C# forum as this kind of optimisation could interest anyone.
    Yea, thats cool. Will leave here for a bit.

    Thanks

    Pino

  5. #5
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    36,576

    Re: String metrics performance

    String comparisons are slower than pretty nearly any other, and you are doing a pretty large number of comparisons. You might benefit from profiling different parts of the algorithm to look for any glaring inefficiencies, but the link you provided shows a fairly tight comparison, so I wouldn't expect that the same results could be obtained via a much faster method. That would leave only two areas that appear worthy of exploration:

    1) How you load the data. If you do this poorly, then performance will suffer, but there may not be a good way to go faster.

    2) Any benefits that might be derived from the comparisons chosen. For example, if A is very similar to B, and C is very dissimilar to A, then C will be very dissimilar to B. If you need to know the exact level of dissimilarity, then this observation would do you no good. If, however, you want some threshold, then perhaps this could be used to pare down the number of comparisons.
    My usual boring signature: Nothing

  6. #6

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: String metrics performance

    thanks for the answer Shaggy.

    I know that String comparisons are the slowest one, but I thind it is possible to do what I want in a matter of minutes. I don't mind waiting 15min if it can save me hours in the end.

    To answer your first point, I first load the data from a simple SELECT DISTINCT query into a listbox. Then for each entry of my list box, I normalize the company name (I take away accents, ponctuations etc.) and add it to an arraylist, which I think is fast enough. What is really long is the next step which is the comparison using the algorithm.

    For now what I did to minimize the number of operations is that I never compare the same two string, so the first one is compared with the 24999 other, the second one with the 24998 others.... So I basically cut the number of operations in two.

    What I thought I could do next is order the strings alphabeticaly and compare a string only with the other strings that start with the same character. I might lose a couple of matches but as these are all company names it shouldn't be so bad as Most of the time people don't do typo on first character

    thanks again.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  7. #7
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    36,576

    Re: String metrics performance

    The listbox could be a point of concern, but if you are not writing to it very often, it would be a one time cost, which would be forgettable. A List would be superior to an ArrayList as long as you are using something from 2005 or later (List (of T) was only introduced with 2005). The performance won't matter much, if any, though, it is more an ease of use issue.

    What is the ultimate goal? That algorithm is looking for similarity between names, and it sounds like you might be doing something considerably less demanding.
    My usual boring signature: Nothing

  8. #8

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: String metrics performance

    Ultimately, this application will allow us to link data from multiple database, based on the company name field. The problem were having here where I work is that we have a lot of data in different system/database that are not linked by any field. We want to be able to do so using the company name field but it is not always entered the same way. This tool will allow us to build the Master company name list. The master will contain the Official company name with all the known ways that it has been entered in our different databases.

    For example :
    Code:
    Company #1 inc. (Master name)
          company 1 inc.
          compani  #1 incorporated.
          COMPANY #1
          Companu #1 inc
          etc.
    Then once we are ready to regroup our data and clean it, this list will allow us to do so.

    the listbox is definitely not a problem for me, it take between 1 ans 4 seconds to fill and it's a one shot deal. I'll sure have a look at the List instead of ArrayList. Even if the gain is not noticable It will at least make my code better.

    thanks a lot for the suggestions.

    If anyone as other succestions, feel free to post.
    Last edited by stlaural; Feb 22nd, 2010 at 02:05 PM.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  9. #9
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    41,727

    Re: String metrics performance

    This sounds like a good idea:
    Quote Originally Posted by stlaural View Post
    What I thought I could do next is order the strings alphabeticaly and compare a string only with the other strings that start with the same character. I might lose a couple of matches but as these are all company names it shouldn't be so bad as Most of the time people don't do typo on first character
    ...but there is definitely room for error there. For example, there are some words that start with S but sound like they start with C.

    I would feel more comfortable not just comparing the same first letter, but also any others that sound similar. To do that you would need to create a list for each of the 26 letters to say which others it can sound like.


    A similar thing would be to check the length of the strings, and skip the check if the difference is too big (eg: 15 characters vs 5).

    One problem with this (and probably any other method) is that it is likely to ignore duplicates where one of them uses words like "the" an "of", but the other doesn't.

  10. #10

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: String metrics performance

    Quote Originally Posted by si_the_geek View Post
    I would feel more comfortable not just comparing the same first letter, but also any others that sound similar. To do that you would need to create a list for each of the 26 letters to say which others it can sound like.
    Right, that would help minimize the possible missed matches. I'll try with and without to see the differences in result.

    Quote Originally Posted by si_the_geek View Post
    A similar thing would be to check the length of the strings, and skip the check if the difference is too big (eg: 15 characters vs 5).
    That also will help accelerate the process as the "distance" between those strings would be too high anyway to be considered a match. I'll also set the maximum difference in length based on the maximum allowed "distance" between strings to be considered a match.

    Thanks a million guys, with this I should be able to reduce the process time significantly and keep a good level of accuracy.
    I'll mark thread as resolved when everything is tested in a couple of days, meanwhile I'll see if anyone has other ideas and post mine if I have some new one.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  11. #11

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: String metrics performance

    Well, I finally tested most of the suggested modifications and my comparison process now runs in about 6 minutes on a dual core and should run much faster on a quad as it will start more threads.

    Thanks to you all for your suggestions.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  12. #12
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    36,576

    Re: [RESOLVED] String metrics performance

    Out of curiosity, how are you setting the number of threads to start? Is that something you add to the code, or are you detecting the number of processors?
    My usual boring signature: Nothing

  13. #13

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: [RESOLVED] String metrics performance

    I used this small piece of code :
    vb.net Code:
    1. Environment.ProcessorCount
    then I fill a list with BackgroundWorkers base on the number that was returned.

    I still have to test it on my personnal computer on which I have an Intel i7 920, I'm curious to see if it will retrun 4, as it have 4 cores or 8 as it uses Hyperthreading. Based on what I found on the web it should return the number of logical processors so on my i7 it should return 8.
    Last edited by stlaural; Feb 24th, 2010 at 10:58 AM.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  14. #14
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    36,576

    Re: [RESOLVED] String metrics performance

    Hunh. I was looking around for something like that, but what I was finding on line seemed to suggest that it would be innacurate. I guess I should give it a try, though.
    My usual boring signature: Nothing

  15. #15

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: [RESOLVED] String metrics performance

    Until now I have to say it works pretty fine, I handled the concurrency and tested everything on my laptop, which has an Intel Centrino dual core and the application starts 2 Backgroundworkers as expected as there is no Hyper Threading. I'll keep you posted after testing with my i7 this weekend.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  16. #16

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: [RESOLVED] String metrics performance

    Tested it on my personnal PC with i7 and Environment.ProcessorCount returns 8 as expected.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  17. #17

    Thread Starter
    Fanatic Member stlaural's Avatar
    Join Date
    Sep 2007
    Location
    Quebec, Canada
    Posts
    683

    Re: [RESOLVED] String metrics performance

    FYI

    I thought that comparing only strings that start with the same character would same me some time, as it would eliminate a lot of loops but after doing some searches and test I figured out that the substring method was way too time consuming. When I use it, comparing 25000 strings with each other took about 8 to 9 minutes on a dual core laptop which is not too bad compared to the first attemped i made, but when I don't use it it takes about 8 secondes !

    I think I reached a good performance level.

    thanks to anyone who helped.
    Alex
    .NET developer
    "No. Not even in the face of Armageddon. Never compromise." (Walter Kovacs/Rorschach)

    Things to consider before posting.
    Don't forget to rate the posts if they helped and mark thread as resolved when they are.


    .Net Regex Syntax (Scripting) | .Net Regex Language Element | .Net Regex Class | DateTime format | Framework 4.0: what's new
    My fresh new blog : writingthecode, even if I don't post much.

    System: Intel i7 920, Kingston SSDNow V100 64gig, HDD WD Caviar Black 1TB, External WD "My Book" 500GB, XFX Radeon 4890 XT 1GB, 12 GBs Tri-Channel RAM, 1x27" and 1x23" LCDs, Windows 10 x64, ]VS2015, Framework 3.5 and 4.0

  18. #18
    Super Moderator Shaggy Hiker's Avatar
    Join Date
    Aug 2002
    Location
    Idaho
    Posts
    36,576

    Re: [RESOLVED] String metrics performance

    String operations are about as slow as anything you can do. Unfortunate thing, that, as we tend to do them plenty.
    My usual boring signature: Nothing

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width