Fastest method of removing duplicates from an array or text file?

**ICESTORM** · May 10th, 2009, 04:46 PM

What's the fastest method of removing duplicates from an array or text file?

The method I'm using right now is reading the file then adding what isn't found in a string into that string using InstrB.

With very few duplicates (like.. 10), it's very fast, in a 3.5 million list, it takes less than 3 seconds on my computer.

With a lot of duplicates, it just dies, freezes.

While a have a program written in C or something that demolishes it in ~10 seconds..

Having InstrB function read through large amounts of data in a loop, looking for every word doesn't work out well with a lot of dupes.

There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?

**Ellis Dee** · May 10th, 2009, 04:55 PM

Originally Posted by ICESTORM

There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?

This is the way to go.

Instead of sorting the list directly, you'd create an index array and sort that. You could then use the index array to identify which dups get removed.

All told it would be pretty darn quick. Is your array just a simple single-dimension string array? Does case matter? ("A" = "a" or "A" <> "a") Let me know the answers to these questions and I can help you out.

**ICESTORM** · May 10th, 2009, 05:29 PM

Single dimension.
Can there be an option of Case-matter on/off?

Thank you.

**Ellis Dee** · May 10th, 2009, 05:31 PM

Originally Posted by ICESTORM

Single dimension.
Can there be an option of Case-matter on/off?

It's a whole lot easier to just use Option Compare Text to make it case-insensitive, but if you want it to be a selectable option I can set it up as a parameter.

**ICESTORM** · May 10th, 2009, 05:53 PM

Uh, Easier way then.

**Ellis Dee** · May 10th, 2009, 06:36 PM

Actually, how fast is a collection for this? It's super simple, but I suspect it may be quite slow.

Code:

Public Sub Test()
    Const Text = "Now is the time for all good men to come to the aid of their country"
    Dim col As Collection
    Dim strTest() As String
    Dim i As Long
    
    strTest = Split(Text, " ")
    Set col = New Collection
    On Error Resume Next
    For i = 0 To UBound(strTest)
        col.Add strTest(i), strTest(i)
    Next
    On Error GoTo 0
    ReDim strTest(col.Count - 1)
    For i = 1 To col.Count
        strTest(i - 1) = col(i)
    Next
    Set col = Nothing
    Debug.Print Text
    Debug.Print Join(strTest, " ")
End Sub

Give it a try and let me know.

**ICESTORM** · May 10th, 2009, 08:10 PM

O.o Should've tried that before, used it to remove dups from a listbox.
Better than what I had before.

Split() function is slow so I used MidB$ and InstrB$ to parse out each item file the file instead.
Merri's QuickSplit is a lot faster also

With the list I tested, it's 25 times slower than another app.
Though, 90% of that time is spent outputting to a textfile..
Putting the Collection items back into the array took about the same amount of time as writing to a file.

Takes 80 seconds for a 75MB text file just to add items to a collection.
Other app counts, reads, sorts, remove dupes, unsorts, and writes over file in 10 seconds.

**Ellis Dee** · May 10th, 2009, 08:23 PM

Actually, that collection technique is a non-starter. It's pretty peppy with lots of duplicates, but with few duplicates it takes a year and a day.

I'm finishing up a faster alternative.

**Ellis Dee** · May 10th, 2009, 08:25 PM

Originally Posted by ICESTORM

Merri's QuickSort is a lot faster also

What do you mean?

**ICESTORM** · May 10th, 2009, 08:27 PM

Oops, meant the "QuickSplit" function that's meant to replace Split()

**Ellis Dee** · May 10th, 2009, 09:48 PM

I ran into problems with the index array approach. I had it all set up perfectly, but it wasn't sorting properly. After staring at it for a while I gave up and rewrote it using a different approach.

This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array. Go through the original array, looking up each entry in the sorted array using a binary search. Use the boolean array to ensure that each element is only used once. All in all it's around 25-35 for seconds for 3.5 million elements on my machine, depending on how many dups there are. (The more dups there are the faster it is.)

There are many ways this exact approach can be improved; I use very slow techniques throughout. I've no doubt someone like Merri could supercharge it pretty effectively. In the meantime, I'll take another stab at the indexed array approach, which really should be faster than 3.5 million binary searches. Maybe Logophobic could set it up if I fail.

I've attached a benchmark program with my solution in case anyone wants to implement their own technique or optimize mine.

**Ellis Dee** · May 10th, 2009, 10:07 PM

One way that approach can be improved is how I handle removing duplicates. I used the very simple method Merri posted at the end of this thread, which is much simpler and slower than other solutions in that thread.

Another is how I copy string arrays using simple assignment:

strDest = strSource

IIRC, that's a slow method.

**ICESTORM** · May 15th, 2009, 06:09 PM

bump.

I think your remove duplicate from array might be ignoring the first item in the array. It compares the 2nd item to the 1st, the 1st item never gets a chance to be added into the new array without duplicates, even if it is unique.

**Code Doc** · May 15th, 2009, 07:07 PM

Ellis, et al., why not sort the array first, as fast as possible. (Shell, Comb, or Jump sorts are the easiest and fastest.)

Then just examine each item sequentially and remove the next element from the array if it matches the current one?

IceStorm, would you like to see simple code that will do this rather painlessly?

**ICESTORM** · May 15th, 2009, 07:52 PM

Is there some way to get a reference/create a ref/index
Create an array of indexes that's sorted.

List
1. B
2. A
3. C
4. E
5. F
6. D
7. H
8. G

Sorted
1. A (Index 2)
2. B (Index 1)
3. C (Index 3)
4. D (Index 6)
5. E (Index 4)
6. F (Index 5)
7. G (Index 8)
8. H (Index 7)

Then use indexes to reset the order.

Other option would maybe be a hash table.
Lookup Index from item...
Wonder if it's going to be slow though.

**Code Doc** · May 15th, 2009, 09:54 PM

Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed.

However, this is not your original post question, which asked for eliminating duplicates elements within an array.

**ICESTORM** · May 15th, 2009, 09:56 PM

"Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed."

How do I do that?

Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.

**Code Doc** · May 17th, 2009, 08:31 AM

Originally Posted by ICESTORM

How do I do that?

Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.

When you remove the element of the sorted array and repack, you can tag the corresponding element of the original referenced array (or convert it to a null or zero--something unique). Then, when it is time to display or save the referenced array, the tagged or nulled out elements are removed.

BTW, it is a bit confusing. I wrote code to do all this about 12 years ago and it did drive me bonkers. I was actually tracking about 10 variables simultaneously and removing elements from them all based on the components of one variable. Then the original elements of all the variables had to be saved back in the order in which they were assembled.

**Ellis Dee** · May 21st, 2009, 09:37 PM

Originally Posted by Code Doc

Ellis, et al., why not sort the array first, as fast as possible. (Shell, Comb, or Jump sorts are the easiest and fastest.)

Then just examine each item sequentially and remove the next element from the array if it matches the current one?

???

That's exactly what I do in the posted solution. To quote: "This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array."

I went with quicksort because it's (much) faster then the algorithms you mentioned.

The real trick is returning to the original order, which is what the boolean array is for.

**Code Doc** · May 22nd, 2009, 06:21 AM

Ellis said, "I went with quicksort because it's (much) faster then the algorithms you mentioned.

The real trick is returning to the original order, which is what the boolean array is for."
---------------
Agreed, I read back that he has a huge number of items to sort. So, the QuickSort would be the best in that case. Carrying the indexed array along for the ride is the part that is tricky.

I suppose we should have provided some sample code using a small list of random data to get OP started, but I got sidetracked.

**Code Doc** · May 22nd, 2009, 08:41 AM

Ellis, you will probably shoot me for this. Here's a way of removing duplicates without sorting them first:

Code:

Const ListSize = 9000
Private Sub Command1_Click()
Dim MyData(ListSize) As String, NoDups(ListSize) As String, RefArr(ListSize) As Byte, Pointer As Long
' Generate Sample Data
For I = 1 To ListSize
    MyData(I) = Format$(CStr(Int(Rnd * ListSize) + 1), "0000")
    RefArr(I) = 1
Next
' Build Non-Duplicate Array
For I = 1 To ListSize - 1
    If RefArr(I) Then
        Pointer = Pointer + 1
        NoDups(Pointer) = MyData(I)
        For J = I + 1 To ListSize
            If MyData(I) = MyData(J) Then RefArr(J) = 0
        Next
    End If
Next
End Sub

I imagine it would be slow for millions of strings, but for less than 10,000 it runs rather well. Note that the comparisons are reduced using the byte array check, and that increases the speed somewhat, especially if there are numerous matches. The nun-duplicate array is produced in the same sequence as the original data array, and that could be important.

**Ellis Dee** · May 22nd, 2009, 11:59 AM

With the OP's spec of 3.5 million items, if there are no dups your method will have to do...

3,500,000 + 3,499,999 + 3,499,998 + ... + 1 = 3,062,501,750,000

...comparisons. Three trillion string comparisons is probably not the quickest way to go.

On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)

**Code Doc** · May 22nd, 2009, 12:26 PM

Originally Posted by Ellis Dee

With the OP's spec of 3.5 million items, if there are no dups your method will have to do...

3,500,000 + 3,499,999 + 3,499,998 + ... + 1 = 3,062,501,750,000

...comparisons. Three trillion string comparisons is probably not the quickest way to go.

On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)

Correct, I am aware of that. On the other hand, if there are lots of dups, the routine runs rather efficiently--the more the merrier. I'm starting to think that there is no super fast way of doing this, especially with a huge array and very few dups.

**Merri** · May 22nd, 2009, 04:55 PM

This project will interest you: MergeSort optimization.

It is a class module (StringSort.cls) that features MergeSort and CombSort. CombSort sorts the given array. MergeSort instead returns an index array, thus leaving the original array as it is. You can then use the index to work with this duplication business you have here.

Internally the class module includes several optimizations, including a very fast ASM binary comparison function for strings. String array is faked to Long array for fast access to string pointers. It pretty much avoids everything that normally would slow things down behind your back.

**Merri** · May 22nd, 2009, 07:44 PM

Double post, but since I went ahead and added my own new duplicate remover to Ellis Dee's benchmarker I thought to post the results.

With millions of duplicates FilterDuplicates is two times faster than Collection method. I don't know how fast the Array method is in comparison since I removed the benchmarking code for that. Collection 18.5 seconds and FilterDuplicates 9 seconds.

For no duplicates Array method gave me some 42 seconds, while FilterDuplicates runs at 6 seconds.

Here is the function (part of StringSort.cls so it does not run standalone):

Code:

Public Sub FilterDuplicates(SA() As String, Result() As String, Optional Compare As VbSortMethod)
    Dim lngA As Long, lngB As Long, lngIndex() As Long, lngPtr As Long, lngBounds(1) As Long
    lngIndex = MergeSort(SA, Compare)
    If Compare = vbBinaryCompare Then
        For lngA = 1 To UBound(lngIndex)
            lngPtr = StrPtr(SA(lngIndex(lngB)))
            If BinaryCompare(lngPtr, StrPtr(SA(lngIndex(lngA)))) < 0 Then
                lngB = lngB + 1
                If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
            End If
        Next lngA
    ElseIf Compare = vbCaseSensitive Then
        For lngA = 1 To UBound(lngIndex)
            If lstrcmpW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
                lngB = lngB + 1
                If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
            End If
        Next lngA
    Else
        For lngA = 1 To UBound(lngIndex)
            If lstrcmpiW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
                lngB = lngB + 1
                If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
            End If
        Next lngA
    End If
    If lngB < UBound(lngIndex) Then
        ' sort lngIndex
        QuickSortLong lngIndex, 0, lngB
        ' create and fill the final results array
        ReDim Result(lngB)
        For lngA = 0 To lngB
            Result(lngA) = SA(lngIndex(lngA))
        Next lngA
    Else
        Result = SA
    End If
End Sub

As you can see, it does not involve high level of optimization. I could go ahead and optimize this function to run at maximal speed and work on the input array only instead of creating a new one. Right now I just wanted to see the concept of sorting Ellis Dee was talking about.

You can find the project at my VB6 folder or simply by downloading the attachment below.

Edit!
Also note that I did not use QuickSort as I got the image that QuickSort does not preserve order. Thus I bothered to see all the trouble to code & optimize MergeSort.

**Ellis Dee** · May 24th, 2009, 09:20 AM

Yeah, that's the stuff.

**Milk** · May 24th, 2009, 10:55 AM

Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilised as the indexes themselves indicate the original order.

Good job everyone BTW.

**Merri** · May 24th, 2009, 12:14 PM

Milk: wouldn't that involve making comparisons of every item against the next one one more time, because when sorting there will be no information preserved on whether the comparison was also equal to neighboring items or not? Thus there is no information on how many equals there are in a row nor whether the previous or next item is equal. You just have a bunch of indexes in a sorted order. As a result QuickSort would be likely to end up being slower than MergeSort that already has preserved the order of indexes, unless I have missed something obvious.

Edit!
I guess I did miss something obvious, but I'll leave this post as it is. (Read: those comparisons will be done anyway, just also look for the smallest index).

**Code Doc** · May 24th, 2009, 07:12 PM

Originally Posted by Milk

Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilized. The indexes themselves indicate the original order.

Good job everyone BTW.

Thanks, Milk, for your applause and I agree. The QuickSort has often been accused of being a memory hog, and I just plumb forgot about the indexes of the QuickSort preserving the original order.

Thread: Fastest method of removing duplicates from an array or text file?

Thread Tools

Display

Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Re: Fastest method of removing duplicates from an array or text file?

Posting Permissions