Results 1 to 29 of 29

Thread: Fastest method of removing duplicates from an array or text file?

  1. #1

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Fastest method of removing duplicates from an array or text file?

    What's the fastest method of removing duplicates from an array or text file?

    The method I'm using right now is reading the file then adding what isn't found in a string into that string using InstrB.

    With very few duplicates (like.. 10), it's very fast, in a 3.5 million list, it takes less than 3 seconds on my computer.

    With a lot of duplicates, it just dies, freezes.
    While a have a program written in C or something that demolishes it in ~10 seconds..

    Having InstrB function read through large amounts of data in a loop, looking for every word doesn't work out well with a lot of dupes.


    There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?
    ......

  2. #2
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by ICESTORM View Post
    There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?
    This is the way to go.

    Instead of sorting the list directly, you'd create an index array and sort that. You could then use the index array to identify which dups get removed.

    All told it would be pretty darn quick. Is your array just a simple single-dimension string array? Does case matter? ("A" = "a" or "A" <> "a") Let me know the answers to these questions and I can help you out.

  3. #3

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    Single dimension.
    Can there be an option of Case-matter on/off?

    Thank you.
    ......

  4. #4
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by ICESTORM View Post
    Single dimension.
    Can there be an option of Case-matter on/off?
    It's a whole lot easier to just use Option Compare Text to make it case-insensitive, but if you want it to be a selectable option I can set it up as a parameter.

  5. #5

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    Uh, Easier way then.
    ......

  6. #6
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Actually, how fast is a collection for this? It's super simple, but I suspect it may be quite slow.
    Code:
    Public Sub Test()
        Const Text = "Now is the time for all good men to come to the aid of their country"
        Dim col As Collection
        Dim strTest() As String
        Dim i As Long
        
        strTest = Split(Text, " ")
        Set col = New Collection
        On Error Resume Next
        For i = 0 To UBound(strTest)
            col.Add strTest(i), strTest(i)
        Next
        On Error GoTo 0
        ReDim strTest(col.Count - 1)
        For i = 1 To col.Count
            strTest(i - 1) = col(i)
        Next
        Set col = Nothing
        Debug.Print Text
        Debug.Print Join(strTest, " ")
    End Sub
    Give it a try and let me know.

  7. #7

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    O.o Should've tried that before, used it to remove dups from a listbox.
    Better than what I had before.

    Split() function is slow so I used MidB$ and InstrB$ to parse out each item file the file instead.
    Merri's QuickSplit is a lot faster also

    With the list I tested, it's 25 times slower than another app.
    Though, 90&#37; of that time is spent outputting to a textfile..
    Putting the Collection items back into the array took about the same amount of time as writing to a file.

    Takes 80 seconds for a 75MB text file just to add items to a collection.
    Other app counts, reads, sorts, remove dupes, unsorts, and writes over file in 10 seconds.
    Last edited by ICESTORM; May 10th, 2009 at 08:26 PM.
    ......

  8. #8
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Actually, that collection technique is a non-starter. It's pretty peppy with lots of duplicates, but with few duplicates it takes a year and a day.

    I'm finishing up a faster alternative.

  9. #9
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by ICESTORM View Post
    Merri's QuickSort is a lot faster also
    What do you mean?

  10. #10

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    Oops, meant the "QuickSplit" function that's meant to replace Split()
    ......

  11. #11
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    I ran into problems with the index array approach. I had it all set up perfectly, but it wasn't sorting properly. After staring at it for a while I gave up and rewrote it using a different approach.

    This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array. Go through the original array, looking up each entry in the sorted array using a binary search. Use the boolean array to ensure that each element is only used once. All in all it's around 25-35 for seconds for 3.5 million elements on my machine, depending on how many dups there are. (The more dups there are the faster it is.)

    There are many ways this exact approach can be improved; I use very slow techniques throughout. I've no doubt someone like Merri could supercharge it pretty effectively. In the meantime, I'll take another stab at the indexed array approach, which really should be faster than 3.5 million binary searches. Maybe Logophobic could set it up if I fail.

    I've attached a benchmark program with my solution in case anyone wants to implement their own technique or optimize mine.
    Attached Files Attached Files

  12. #12
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    One way that approach can be improved is how I handle removing duplicates. I used the very simple method Merri posted at the end of this thread, which is much simpler and slower than other solutions in that thread.

    Another is how I copy string arrays using simple assignment:

    strDest = strSource

    IIRC, that's a slow method.

  13. #13

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    bump.

    I think your remove duplicate from array might be ignoring the first item in the array. It compares the 2nd item to the 1st, the 1st item never gets a chance to be added into the new array without duplicates, even if it is unique.
    ......

  14. #14
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Ellis, et al., why not sort the array first, as fast as possible. (Shell, Comb, or Jump sorts are the easiest and fastest.)

    Then just examine each item sequentially and remove the next element from the array if it matches the current one?

    IceStorm, would you like to see simple code that will do this rather painlessly?
    Last edited by Code Doc; May 15th, 2009 at 07:28 PM.
    Doctor Ed

  15. #15

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    Is there some way to get a reference/create a ref/index
    Create an array of indexes that's sorted.

    List
    1. B
    2. A
    3. C
    4. E
    5. F
    6. D
    7. H
    8. G

    Sorted
    1. A (Index 2)
    2. B (Index 1)
    3. C (Index 3)
    4. D (Index 6)
    5. E (Index 4)
    6. F (Index 5)
    7. G (Index 8)
    8. H (Index 7)

    Then use indexes to reset the order.



    Other option would maybe be a hash table.
    Lookup Index from item...
    Wonder if it's going to be slow though.
    Last edited by ICESTORM; May 15th, 2009 at 08:48 PM.
    ......

  16. #16
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed.

    However, this is not your original post question, which asked for eliminating duplicates elements within an array.
    Doctor Ed

  17. #17

    Thread Starter
    Addicted Member
    Join Date
    Jul 2004
    Location
    cali
    Posts
    243

    Re: Fastest method of removing duplicates from an array or text file?

    "Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed."
    How do I do that?

    Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.
    ......

  18. #18
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by ICESTORM View Post
    How do I do that?

    Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.
    When you remove the element of the sorted array and repack, you can tag the corresponding element of the original referenced array (or convert it to a null or zero--something unique). Then, when it is time to display or save the referenced array, the tagged or nulled out elements are removed.

    BTW, it is a bit confusing. I wrote code to do all this about 12 years ago and it did drive me bonkers. I was actually tracking about 10 variables simultaneously and removing elements from them all based on the components of one variable. Then the original elements of all the variables had to be saved back in the order in which they were assembled.
    Doctor Ed

  19. #19
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by Code Doc View Post
    Ellis, et al., why not sort the array first, as fast as possible. (Shell, Comb, or Jump sorts are the easiest and fastest.)

    Then just examine each item sequentially and remove the next element from the array if it matches the current one?
    ???

    That's exactly what I do in the posted solution. To quote: "This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array."

    I went with quicksort because it's (much) faster then the algorithms you mentioned.

    The real trick is returning to the original order, which is what the boolean array is for.

  20. #20
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Ellis said, "I went with quicksort because it's (much) faster then the algorithms you mentioned.

    The real trick is returning to the original order, which is what the boolean array is for."
    ---------------
    Agreed, I read back that he has a huge number of items to sort. So, the QuickSort would be the best in that case. Carrying the indexed array along for the ride is the part that is tricky.

    I suppose we should have provided some sample code using a small list of random data to get OP started, but I got sidetracked.
    Doctor Ed

  21. #21
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Ellis, you will probably shoot me for this. Here's a way of removing duplicates without sorting them first:
    Code:
    Const ListSize = 9000
    Private Sub Command1_Click()
    Dim MyData(ListSize) As String, NoDups(ListSize) As String, RefArr(ListSize) As Byte, Pointer As Long
    ' Generate Sample Data
    For I = 1 To ListSize
        MyData(I) = Format$(CStr(Int(Rnd * ListSize) + 1), "0000")
        RefArr(I) = 1
    Next
    ' Build Non-Duplicate Array
    For I = 1 To ListSize - 1
        If RefArr(I) Then
            Pointer = Pointer + 1
            NoDups(Pointer) = MyData(I)
            For J = I + 1 To ListSize
                If MyData(I) = MyData(J) Then RefArr(J) = 0
            Next
        End If
    Next
    End Sub
    I imagine it would be slow for millions of strings, but for less than 10,000 it runs rather well. Note that the comparisons are reduced using the byte array check, and that increases the speed somewhat, especially if there are numerous matches. The nun-duplicate array is produced in the same sequence as the original data array, and that could be important.
    Doctor Ed

  22. #22
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    With the OP's spec of 3.5 million items, if there are no dups your method will have to do...

    3,500,000 + 3,499,999 + 3,499,998 + ... + 1 = 3,062,501,750,000

    ...comparisons. Three trillion string comparisons is probably not the quickest way to go.

    On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)

  23. #23
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by Ellis Dee View Post
    With the OP's spec of 3.5 million items, if there are no dups your method will have to do...

    3,500,000 + 3,499,999 + 3,499,998 + ... + 1 = 3,062,501,750,000

    ...comparisons. Three trillion string comparisons is probably not the quickest way to go.

    On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)
    Correct, I am aware of that. On the other hand, if there are lots of dups, the routine runs rather efficiently--the more the merrier. I'm starting to think that there is no super fast way of doing this, especially with a huge array and very few dups.
    Doctor Ed

  24. #24
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: Fastest method of removing duplicates from an array or text file?

    This project will interest you: MergeSort optimization.

    It is a class module (StringSort.cls) that features MergeSort and CombSort. CombSort sorts the given array. MergeSort instead returns an index array, thus leaving the original array as it is. You can then use the index to work with this duplication business you have here.

    Internally the class module includes several optimizations, including a very fast ASM binary comparison function for strings. String array is faked to Long array for fast access to string pointers. It pretty much avoids everything that normally would slow things down behind your back.
    Last edited by Merri; May 22nd, 2009 at 07:54 PM.

  25. #25
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: Fastest method of removing duplicates from an array or text file?

    Double post, but since I went ahead and added my own new duplicate remover to Ellis Dee's benchmarker I thought to post the results.

    With millions of duplicates FilterDuplicates is two times faster than Collection method. I don't know how fast the Array method is in comparison since I removed the benchmarking code for that. Collection 18.5 seconds and FilterDuplicates 9 seconds.

    For no duplicates Array method gave me some 42 seconds, while FilterDuplicates runs at 6 seconds.

    Here is the function (part of StringSort.cls so it does not run standalone):
    Code:
    Public Sub FilterDuplicates(SA() As String, Result() As String, Optional Compare As VbSortMethod)
        Dim lngA As Long, lngB As Long, lngIndex() As Long, lngPtr As Long, lngBounds(1) As Long
        lngIndex = MergeSort(SA, Compare)
        If Compare = vbBinaryCompare Then
            For lngA = 1 To UBound(lngIndex)
                lngPtr = StrPtr(SA(lngIndex(lngB)))
                If BinaryCompare(lngPtr, StrPtr(SA(lngIndex(lngA)))) < 0 Then
                    lngB = lngB + 1
                    If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
                End If
            Next lngA
        ElseIf Compare = vbCaseSensitive Then
            For lngA = 1 To UBound(lngIndex)
                If lstrcmpW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
                    lngB = lngB + 1
                    If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
                End If
            Next lngA
        Else
            For lngA = 1 To UBound(lngIndex)
                If lstrcmpiW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
                    lngB = lngB + 1
                    If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
                End If
            Next lngA
        End If
        If lngB < UBound(lngIndex) Then
            ' sort lngIndex
            QuickSortLong lngIndex, 0, lngB
            ' create and fill the final results array
            ReDim Result(lngB)
            For lngA = 0 To lngB
                Result(lngA) = SA(lngIndex(lngA))
            Next lngA
        Else
            Result = SA
        End If
    End Sub
    As you can see, it does not involve high level of optimization. I could go ahead and optimize this function to run at maximal speed and work on the input array only instead of creating a new one. Right now I just wanted to see the concept of sorting Ellis Dee was talking about.

    You can find the project at my VB6 folder or simply by downloading the attachment below.



    Edit!
    Also note that I did not use QuickSort as I got the image that QuickSort does not preserve order. Thus I bothered to see all the trouble to code & optimize MergeSort.
    Attached Files Attached Files
    Last edited by Merri; May 22nd, 2009 at 08:00 PM. Reason: Forgot to save and close project before zipping, reposting attachment

  26. #26
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Fastest method of removing duplicates from an array or text file?

    Yeah, that's the stuff.

  27. #27
    Cumbrian Milk's Avatar
    Join Date
    Jan 2007
    Location
    0xDEADBEEF
    Posts
    2,448

    Re: Fastest method of removing duplicates from an array or text file?

    Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilised as the indexes themselves indicate the original order.

    Good job everyone BTW.

  28. #28
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Re: Fastest method of removing duplicates from an array or text file?

    Milk: wouldn't that involve making comparisons of every item against the next one one more time, because when sorting there will be no information preserved on whether the comparison was also equal to neighboring items or not? Thus there is no information on how many equals there are in a row nor whether the previous or next item is equal. You just have a bunch of indexes in a sorted order. As a result QuickSort would be likely to end up being slower than MergeSort that already has preserved the order of indexes, unless I have missed something obvious.


    Edit!
    I guess I did miss something obvious, but I'll leave this post as it is. (Read: those comparisons will be done anyway, just also look for the smallest index).
    Last edited by Merri; May 24th, 2009 at 12:22 PM.

  29. #29
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Fastest method of removing duplicates from an array or text file?

    Quote Originally Posted by Milk View Post
    Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilized. The indexes themselves indicate the original order.

    Good job everyone BTW.
    Thanks, Milk, for your applause and I agree. The QuickSort has often been accused of being a memory hog, and I just plumb forgot about the indexes of the QuickSort preserving the original order.
    Doctor Ed

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width