Fastest method of removing duplicates from an array or text file?
What's the fastest method of removing duplicates from an array or text file?
The method I'm using right now is reading the file then adding what isn't found in a string into that string using InstrB.
With very few duplicates (like.. 10), it's very fast, in a 3.5 million list, it takes less than 3 seconds on my computer.
With a lot of duplicates, it just dies, freezes.
While a have a program written in C or something that demolishes it in ~10 seconds..
Having InstrB function read through large amounts of data in a loop, looking for every word doesn't work out well with a lot of dupes.
There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?
Re: Fastest method of removing duplicates from an array or text file?
Originally Posted by ICESTORM
There's also sorting the list first then comparing each one with the next, removing dupes in 1 pass, but how would you rearrange the sorted list back into the previous order?
This is the way to go.
Instead of sorting the list directly, you'd create an index array and sort that. You could then use the index array to identify which dups get removed.
All told it would be pretty darn quick. Is your array just a simple single-dimension string array? Does case matter? ("A" = "a" or "A" <> "a") Let me know the answers to these questions and I can help you out.
Re: Fastest method of removing duplicates from an array or text file?
Originally Posted by ICESTORM
Single dimension.
Can there be an option of Case-matter on/off?
It's a whole lot easier to just use Option Compare Text to make it case-insensitive, but if you want it to be a selectable option I can set it up as a parameter.
Re: Fastest method of removing duplicates from an array or text file?
Actually, how fast is a collection for this? It's super simple, but I suspect it may be quite slow.
Code:
Public Sub Test()
Const Text = "Now is the time for all good men to come to the aid of their country"
Dim col As Collection
Dim strTest() As String
Dim i As Long
strTest = Split(Text, " ")
Set col = New Collection
On Error Resume Next
For i = 0 To UBound(strTest)
col.Add strTest(i), strTest(i)
Next
On Error GoTo 0
ReDim strTest(col.Count - 1)
For i = 1 To col.Count
strTest(i - 1) = col(i)
Next
Set col = Nothing
Debug.Print Text
Debug.Print Join(strTest, " ")
End Sub
Re: Fastest method of removing duplicates from an array or text file?
O.o Should've tried that before, used it to remove dups from a listbox.
Better than what I had before.
Split() function is slow so I used MidB$ and InstrB$ to parse out each item file the file instead.
Merri's QuickSplit is a lot faster also
With the list I tested, it's 25 times slower than another app.
Though, 90% of that time is spent outputting to a textfile..
Putting the Collection items back into the array took about the same amount of time as writing to a file.
Takes 80 seconds for a 75MB text file just to add items to a collection.
Other app counts, reads, sorts, remove dupes, unsorts, and writes over file in 10 seconds.
Last edited by ICESTORM; May 10th, 2009 at 08:26 PM.
Re: Fastest method of removing duplicates from an array or text file?
I ran into problems with the index array approach. I had it all set up perfectly, but it wasn't sorting properly. After staring at it for a while I gave up and rewrote it using a different approach.
This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array. Go through the original array, looking up each entry in the sorted array using a binary search. Use the boolean array to ensure that each element is only used once. All in all it's around 25-35 for seconds for 3.5 million elements on my machine, depending on how many dups there are. (The more dups there are the faster it is.)
There are many ways this exact approach can be improved; I use very slow techniques throughout. I've no doubt someone like Merri could supercharge it pretty effectively. In the meantime, I'll take another stab at the indexed array approach, which really should be faster than 3.5 million binary searches. Maybe Logophobic could set it up if I fail.
I've attached a benchmark program with my solution in case anyone wants to implement their own technique or optimize mine.
Re: Fastest method of removing duplicates from an array or text file?
One way that approach can be improved is how I handle removing duplicates. I used the very simple method Merri posted at the end of this thread, which is much simpler and slower than other solutions in that thread.
Another is how I copy string arrays using simple assignment:
Re: Fastest method of removing duplicates from an array or text file?
bump.
I think your remove duplicate from array might be ignoring the first item in the array. It compares the 2nd item to the 1st, the 1st item never gets a chance to be added into the new array without duplicates, even if it is unique.
Re: Fastest method of removing duplicates from an array or text file?
Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed.
However, this is not your original post question, which asked for eliminating duplicates elements within an array.
Re: Fastest method of removing duplicates from an array or text file?
"Yes. You can store two arrays, sort one array as needed, carry the other array in memory, kind of like a backup. Then just replay the other referenced array back out as needed."
How do I do that?
Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.
Re: Fastest method of removing duplicates from an array or text file?
Originally Posted by ICESTORM
How do I do that?
Resorting to original order is one of the problems to solve, since the the list is being sorted then items are being removed.
When you remove the element of the sorted array and repack, you can tag the corresponding element of the original referenced array (or convert it to a null or zero--something unique). Then, when it is time to display or save the referenced array, the tagged or nulled out elements are removed.
BTW, it is a bit confusing. I wrote code to do all this about 12 years ago and it did drive me bonkers. I was actually tracking about 10 variables simultaneously and removing elements from them all based on the components of one variable. Then the original elements of all the variables had to be saved back in the order in which they were assembled.
Re: Fastest method of removing duplicates from an array or text file?
Originally Posted by Code Doc
Ellis, et al., why not sort the array first, as fast as possible. (Shell, Comb, or Jump sorts are the easiest and fastest.)
Then just examine each item sequentially and remove the next element from the array if it matches the current one?
???
That's exactly what I do in the posted solution. To quote: "This new approach is to create a mirror array, sort it, remove the dups, then create a boolean array to match that sorted array."
I went with quicksort because it's (much) faster then the algorithms you mentioned.
The real trick is returning to the original order, which is what the boolean array is for.
Re: Fastest method of removing duplicates from an array or text file?
Ellis said, "I went with quicksort because it's (much) faster then the algorithms you mentioned.
The real trick is returning to the original order, which is what the boolean array is for."
---------------
Agreed, I read back that he has a huge number of items to sort. So, the QuickSort would be the best in that case. Carrying the indexed array along for the ride is the part that is tricky.
I suppose we should have provided some sample code using a small list of random data to get OP started, but I got sidetracked.
Re: Fastest method of removing duplicates from an array or text file?
Ellis, you will probably shoot me for this. Here's a way of removing duplicates without sorting them first:
Code:
Const ListSize = 9000
Private Sub Command1_Click()
Dim MyData(ListSize) As String, NoDups(ListSize) As String, RefArr(ListSize) As Byte, Pointer As Long
' Generate Sample Data
For I = 1 To ListSize
MyData(I) = Format$(CStr(Int(Rnd * ListSize) + 1), "0000")
RefArr(I) = 1
Next
' Build Non-Duplicate Array
For I = 1 To ListSize - 1
If RefArr(I) Then
Pointer = Pointer + 1
NoDups(Pointer) = MyData(I)
For J = I + 1 To ListSize
If MyData(I) = MyData(J) Then RefArr(J) = 0
Next
End If
Next
End Sub
I imagine it would be slow for millions of strings, but for less than 10,000 it runs rather well. Note that the comparisons are reduced using the byte array check, and that increases the speed somewhat, especially if there are numerous matches. The nun-duplicate array is produced in the same sequence as the original data array, and that could be important.
...comparisons. Three trillion string comparisons is probably not the quickest way to go.
On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)
...comparisons. Three trillion string comparisons is probably not the quickest way to go.
On a tangent, one of the problems with using a sorted index array (instead of sorting a mirror array) is that you end up needing to sort twice; first when you create the sorted index, then after removing the dups you have to sort the index array back to the original order. So I think sorting a mirror array and using a boolean lookup array might be the fastest way to go, especially if the routines get highly optimized. (My posted solution isn't optimized at all.)
Correct, I am aware of that. On the other hand, if there are lots of dups, the routine runs rather efficiently--the more the merrier. I'm starting to think that there is no super fast way of doing this, especially with a huge array and very few dups.
It is a class module (StringSort.cls) that features MergeSort and CombSort. CombSort sorts the given array. MergeSort instead returns an index array, thus leaving the original array as it is. You can then use the index to work with this duplication business you have here.
Internally the class module includes several optimizations, including a very fast ASM binary comparison function for strings. String array is faked to Long array for fast access to string pointers. It pretty much avoids everything that normally would slow things down behind your back.
Re: Fastest method of removing duplicates from an array or text file?
Double post, but since I went ahead and added my own new duplicate remover to Ellis Dee's benchmarker I thought to post the results.
With millions of duplicates FilterDuplicates is two times faster than Collection method. I don't know how fast the Array method is in comparison since I removed the benchmarking code for that. Collection 18.5 seconds and FilterDuplicates 9 seconds.
For no duplicates Array method gave me some 42 seconds, while FilterDuplicates runs at 6 seconds.
Here is the function (part of StringSort.cls so it does not run standalone):
Code:
Public Sub FilterDuplicates(SA() As String, Result() As String, Optional Compare As VbSortMethod)
Dim lngA As Long, lngB As Long, lngIndex() As Long, lngPtr As Long, lngBounds(1) As Long
lngIndex = MergeSort(SA, Compare)
If Compare = vbBinaryCompare Then
For lngA = 1 To UBound(lngIndex)
lngPtr = StrPtr(SA(lngIndex(lngB)))
If BinaryCompare(lngPtr, StrPtr(SA(lngIndex(lngA)))) < 0 Then
lngB = lngB + 1
If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
End If
Next lngA
ElseIf Compare = vbCaseSensitive Then
For lngA = 1 To UBound(lngIndex)
If lstrcmpW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
lngB = lngB + 1
If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
End If
Next lngA
Else
For lngA = 1 To UBound(lngIndex)
If lstrcmpiW(StrPtr(SA(lngIndex(lngB))), StrPtr(SA(lngIndex(lngA)))) < 0 Then
lngB = lngB + 1
If lngB < lngA Then lngIndex(lngB) = lngIndex(lngA)
End If
Next lngA
End If
If lngB < UBound(lngIndex) Then
' sort lngIndex
QuickSortLong lngIndex, 0, lngB
' create and fill the final results array
ReDim Result(lngB)
For lngA = 0 To lngB
Result(lngA) = SA(lngIndex(lngA))
Next lngA
Else
Result = SA
End If
End Sub
As you can see, it does not involve high level of optimization. I could go ahead and optimize this function to run at maximal speed and work on the input array only instead of creating a new one. Right now I just wanted to see the concept of sorting Ellis Dee was talking about.
You can find the project at my VB6 folder or simply by downloading the attachment below.
Edit!
Also note that I did not use QuickSort as I got the image that QuickSort does not preserve order. Thus I bothered to see all the trouble to code & optimize MergeSort.
Last edited by Merri; May 22nd, 2009 at 08:00 PM.
Reason: Forgot to save and close project before zipping, reposting attachment
Re: Fastest method of removing duplicates from an array or text file?
Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilised as the indexes themselves indicate the original order.
Re: Fastest method of removing duplicates from an array or text file?
Milk: wouldn't that involve making comparisons of every item against the next one one more time, because when sorting there will be no information preserved on whether the comparison was also equal to neighboring items or not? Thus there is no information on how many equals there are in a row nor whether the previous or next item is equal. You just have a bunch of indexes in a sorted order. As a result QuickSort would be likely to end up being slower than MergeSort that already has preserved the order of indexes, unless I have missed something obvious.
Edit!
I guess I did miss something obvious, but I'll leave this post as it is. (Read: those comparisons will be done anyway, just also look for the smallest index).
Re: Fastest method of removing duplicates from an array or text file?
Originally Posted by Milk
Just a quick one re QuickSort not preserving order. If it's being used to return an index array then the results can be easily and quickly stabilized. The indexes themselves indicate the original order.
Good job everyone BTW.
Thanks, Milk, for your applause and I agree. The QuickSort has often been accused of being a memory hog, and I just plumb forgot about the indexes of the QuickSort preserving the original order.