Fastest, most efficient way to create one unique list of strings from two other lists
I have two lists of names from which I need to make one master list. The one master list should contain all names in both of the two lists but only once (each name in master list is unique). What is the fastest, most efficient way (in terms of memory utilization) to filter out the duplicates? I recognize that these two goals may be mutually exclusive.
I'm currently using a collection to do this - I add the name from each list into the master collection object using the name as the key. The collection prevents a duplicate entry from being added because an error is raised when a duplicate key is attempted to be added.
This works OK but I've got two concerns: 1) The memory consumption of using a collection over a simpler data structure like an array when talking about hundreds of thousands of items is not insignificant. I've encountered out of memory errors in this routine at times. 2) The performance of adding items to a collection (with keys) and using error handling to filter out duplicates may not be the fastest way to achieve this.
Note: Each of two lists can contain hundreds of thousands of names - some of which may contain Unicode characters.
Any brainstorming ideas of alternate approaches are appreciated.
Re: Fastest, most efficient way to create one unique list of strings from two other l
Hundreds of thousands of strings per list? Maybe using a recordset may be more memory efficient, using the string as a primary key to prevent duplicates? Not sure what kind of speed hits you'll encounter.
If the lists weren't so large, I'd suggest binary sorting to quickly identify duplicates. But that does require keeping your lists sorted as you create/append to them and resizing arrays as needed. Binary sorting can rule out a duplicate in less than 20 iterations among a "collection" of 1 million entries.
You should get plenty of suggestions and ideas...
Edited. You might want to describe how you are getting those strings? Do they exist in a file or somewhere else? If so, there are ways to organize a file or creating/using indexes to quickly access the data from file vs. keeping it all in memory.
Last edited by LaVolpe; May 26th, 2018 at 10:54 AM.
Insomnia is just a byproduct of, "It can't be done"
Re: Fastest, most efficient way to create one unique list of strings from two other l
Originally Posted by AAraya
2) The performance of adding items to a collection (with keys) and using error handling to filter out duplicates may not be the fastest way to achieve this.
The performance penalty of using err handling comes from VB6 populating Error object on failure. I've been using a tweaked VBA.Collection interface that *does not* raise errors on all its methods but returns the HRESULT as Long. This way failure cases on searching (Item function) or adding (Add sub) are not penalized and the performance is orders of magnitude better.
Whether or not you are going to be using built-in Collection, still no other data structure can beat a good old hashtable in your use-case performancewise IMO.
Re: Fastest, most efficient way to create one unique list of strings from two other l
Well perhaps something like this hacked version of my HashTub class might work.
This one doesn't store separate Key and Item values, but just uses Item for both. Access is only by index but it does offer an Exists by Item value.
Here is how I created test items to add:
Code:
With New HashTubString
Population = 1000000
Prt "Population" & vbTab & Format$(Population, "#,##0") & " items"
Prt "Overflow chunk" & vbTab & Format$(.Chunk, "#,##0") & " items"
For I = 1 To Population
'Create item values with some but not too much uniqueness:
.Add MonthName((I Mod 12) + 1) & CStr(I Mod Population \ 4)
Next
Duplicates get eliminated. "Add" is the time it took to add items. "Count" is how many unique items got stored. "Enum" is the time it took to iterate over the list of items checking for adjacent duplicates (which shouldn't be there of course, but I wanted something within the loop).
Re: Fastest, most efficient way to create one unique list of strings from two other l
Originally Posted by dilettante
Well perhaps something like this hacked version of my HashTub class might work.
This one doesn't store separate Key and Item values, but just uses Item for both. Access is only by index but it does offer an Exists by Item value.
Here is how I created test items to add:
Code:
With New HashTubString
Population = 1000000
Prt "Population" & vbTab & Format$(Population, "#,##0") & " items"
Prt "Overflow chunk" & vbTab & Format$(.Chunk, "#,##0") & " items"
For I = 1 To Population
'Create item values with some but not too much uniqueness:
.Add MonthName((I Mod 12) + 1) & CStr(I Mod Population \ 4)
Next
Duplicates get eliminated. "Add" is the time it took to add items. "Count" is how many unique items got stored. "Enum" is the time it took to iterate over the list of items checking for adjacent duplicates (which shouldn't be there of course, but I wanted something within the loop).
It doesn't have Remove method.
Is there any implement similar to .NET SortedDictionary in VB6?
Last edited by Zvoni; Tomorrow at 31:69 PM.
----------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
Re: Fastest, most efficient way to create one unique list of strings from two other l
I just might have had a monumentally stupid idea
Please bear with me.
Condition: Both Arrays are unique in itself (meaning: no duplicates in itself)
Algorythm as follows: (AIRCODE!!)
Code:
Dim arrMaster() As String
Dim sTemp As String
Dim i As Long
sTemp=Join(MyArray1,"@:@") 'Or use any unique Delimiter, that doesn't appear in any of both Arrays
For i=LBound(MyArray2) To UBound(MyArray2)
If Not InStr(1, sTemp, MyArray(i), vbBinaryCompare) Then 'or use vbTextCompare for case-insensitive
sTemp=sTemp & "@:@" & MyArray2(i)
End If
Next
arrMaster=Split(sTemp, "@:@")
'send arrMaster to QuickSort
Issues:
No ideas about performance (as i said: maybe a monumentally stupid Idea)
No idea about sTemp exceeding the 2GB-Limit
Now you can start laughing and ridiculing me...
Last edited by Zvoni; Tomorrow at 31:69 PM.
----------------------------------------------------------------------------------------
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------------------
People call me crazy because i'm jumping out of perfectly fine airplanes.
---------------------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad
Re: Fastest, most efficient way to create one unique list of strings from two other l
A couple of you mentioned sorting of the list... Well, I didn't mention it in the original post since I didn't think it relevant to my original question but yes, the master list does need to be sorted. I was simply going to take the master list compiled by this routine and sort it. But perhaps it is relevant to deciding the best data structures and algorithms to handle this entire task.
So, the complete task is:
Given two collections of file names
Create one master list of all file names with no duplicates (case-insensitive, i.e. "file.txt" = "FILE.TXT")
The master list should be sorted using the same logic as Explorer (using StrCmpLogicalW)
The solution should be as fast and as memory efficient as possible and scalable to hundreds of thousands of files - maybe even approaching 1 million files, if possible..
Last edited by AAraya; May 27th, 2018 at 07:19 PM.
I'm sorry but I don't read Russian. Can you explain what this post thread is about? If you're directing me to a code sample, is it the Add routine in your code that you are showing me?
Code:
Private Sub Add(Items() As HashItem, Value As String)
Dim I As Long
Dim N As Long
I = CalcHash(Value)
If Len(Items(I).Value) Then
If StrComp(Items(I).Value, Value, vbTextCompare) = 0 Then Exit Sub
Do While Items(I).Next
I = Items(I).Next
If StrComp(Items(I).Value, Value, vbTextCompare) = 0 Then Exit Sub
Loop
N = UBound(Items) + 1
ReDim Preserve Items(N)
Items(N).Value = Value
Items(I).Next = N
Else: Items(I).Value = Value
End If
End Sub
Re: Fastest, most efficient way to create one unique list of strings from two other l
Originally Posted by wqweto
The performance penalty of using err handling comes from VB6 populating Error object on failure. I've been using a tweaked VBA.Collection interface that *does not* raise errors on all its methods but returns the HRESULT as Long. This way failure cases on searching (Item function) or adding (Add sub) are not penalized and the performance is orders of magnitude better.
Whether or not you are going to be using built-in Collection, still no other data structure can beat a good old hashtable in your use-case performancewise IMO.
</wqw>
This was helpful info, thank you. I was already able to improve performance using some of the insights you provided here. Thank you!
Re: Fastest, most efficient way to create one unique list of strings from two other l
Originally Posted by AAraya
So, the complete task is:
Given two collections of file names
Create one master list of all file names with no duplicates (case-insensitive, i.e. "file.txt" = "FILE.TXT")
The master list should be sorted using the same logic as Explorer (using StrCmpLogicalW)
The solution should be as fast and as memory efficient as possible and scalable to hundreds of thousands of files - maybe even approaching 1 million files, if possible..
The RC5s cSortedDictionary can accomplish all that already at "Adding-Time" in one go (being always sorted at any time):
Code:
'construction
Set D = New_c.SortedDictionary
D.SetStringCompareFlags cmpLogical
'and later usage
If Not D.Exists(strNewCandidate) Then D.Add strNewCandidate
If you pass D along as a Parameter into your other routines (which formerly filled your two "Source-Collections"),
you don't need to populate these Extra-Containers at all - just the magenta-colored line above would be enough.
Be advised, that StrComp-Logical is an expensive Comparison (in a Sort-Algo, or a sorting collection).
Normal Case-Insensitive Comparisons/Sorts are about factor 3 faster, as the following example shows:
Code:
Option Explicit
Private Sub Form_Click()
AutoRedraw = True: Cls
Dim D As cSortedDictionary, i As Long
Const Count As Long = 500000
ReDim SArr(0 To Count - 1) As String
For i = 0 To Count - 1
SArr(i) = MonthName((i Mod 12) + 1) & CStr(i Mod Count \ 4)
Next
New_c.Timing True
Set D = New_c.SortedDictionary(TextCompare)
For i = 0 To Count - 1
If Not D.Exists(SArr(i)) Then D.Add SArr(i)
Next
Print "case-insensitive normal:", D.Count; New_c.Timing
Set D = Nothing 'clear the Dictionary
New_c.Timing True
Set D = New_c.SortedDictionary
D.SetStringCompareFlags cmpLogical
For i = 0 To Count - 1
If Not D.Exists(SArr(i)) Then D.Add SArr(i)
Next
Print "case-insensitive logical:", D.Count; New_c.Timing
End Sub
Re: Fastest, most efficient way to create one unique list of strings from two other l
For a long time I have these 2 routines for merging sorted string arrays.
Maybe they can be of any use to you.
Code:
Private Function MergeIntersection(List1() As Long, List2() As Long, lOutput() As Long) As Boolean
Dim l1 As Long, l2 As Long
Dim lU1 As Long, lU2 As Long
Dim lValue1 As Long, lValue2 As Long
Dim lCnt As Long
' If 1 of the arrays is empty then the output is also empty
If Not IsDimmed(List1) Then
Erase lOutput
Exit Function
End If
If Not IsDimmed(List2) Then
Erase lOutput
Exit Function
End If
lU1 = UBound(List1)
lU2 = UBound(List2)
ReDim lOutput(lU1)
Do
lValue1 = List1(l1)
lValue2 = List2(l2)
If lValue1 = lValue2 Then
lOutput(lCnt) = lValue1
lCnt = lCnt + 1
l1 = l1 + 1
l2 = l2 + 1
ElseIf lValue1 > lValue2 Then
l2 = l2 + 1
Else
l1 = l1 + 1
End If
Loop Until l1 > lU1 Or l2 > lU2
If lCnt = 0 Then
Erase lOutput
Else
ReDim Preserve lOutput(lCnt - 1)
MergeIntersection = True
End If
End Function
Private Function MergeUnion(List1() As Long, List2() As Long, lOutput() As Long) As Boolean
Dim l1 As Long, l2 As Long
Dim lU1 As Long, lU2 As Long
Dim lValue1 As Long, lValue2 As Long
Dim lCnt As Long, i As Long
Dim bDimmed1 As Boolean, bDimmed2 As Boolean
bDimmed1 = IsDimmed(List1)
bDimmed2 = IsDimmed(List2)
' If both arrays are empty then the result is also empty
If Not bDimmed1 And Not bDimmed2 Then
Erase lOutput
Exit Function
End If
If Not bDimmed1 Then
lOutput = List2
MergeUnion = True
Exit Function
End If
If Not bDimmed2 Then
lOutput = List1
MergeUnion = True
Exit Function
End If
lU1 = UBound(List1)
lU2 = UBound(List2)
ReDim lOutput(lU1 + lU2 + 1)
lValue1 = List1(l1)
lValue2 = List2(l2)
Do
If l1 > lU1 Then
For i = l2 To lU2
lValue2 = List2(i)
If lValue2 <> lValue1 Then
lOutput(lCnt) = lValue2
lCnt = lCnt + 1
End If
Next i
Exit Do
ElseIf l2 > lU2 Then
For i = l1 To lU1
lValue1 = List1(i)
If lValue1 <> lValue2 Then
lOutput(lCnt) = lValue1
lCnt = lCnt + 1
End If
Next i
Exit Do
Else
If lValue1 = lValue2 Then
lOutput(lCnt) = lValue1
l1 = l1 + 1
l2 = l2 + 1
If l1 <= lU1 Then lValue1 = List1(l1)
If l2 <= lU2 Then lValue2 = List2(l2)
ElseIf lValue1 > lValue2 Then
lOutput(lCnt) = lValue2
l2 = l2 + 1
If l2 <= lU2 Then lValue2 = List2(l2)
Else
lOutput(lCnt) = lValue1
l1 = l1 + 1
If l1 <= lU1 Then lValue1 = List1(l1)
End If
lCnt = lCnt + 1
End If
Loop Until l1 > lU1 And l2 > lU2
If lCnt = 0 Then
Erase lOutput
Else
ReDim Preserve lOutput(lCnt - 1)
MergeUnion = True
End If
End Function