Results 1 to 14 of 14

Thread: Fastest, most efficient way to create one unique list of strings from two other lists

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Aug 2011
    Location
    Palm Coast, FL
    Posts
    416

    Fastest, most efficient way to create one unique list of strings from two other lists

    I have two lists of names from which I need to make one master list. The one master list should contain all names in both of the two lists but only once (each name in master list is unique). What is the fastest, most efficient way (in terms of memory utilization) to filter out the duplicates? I recognize that these two goals may be mutually exclusive.

    I'm currently using a collection to do this - I add the name from each list into the master collection object using the name as the key. The collection prevents a duplicate entry from being added because an error is raised when a duplicate key is attempted to be added.

    This works OK but I've got two concerns: 1) The memory consumption of using a collection over a simpler data structure like an array when talking about hundreds of thousands of items is not insignificant. I've encountered out of memory errors in this routine at times. 2) The performance of adding items to a collection (with keys) and using error handling to filter out duplicates may not be the fastest way to achieve this.

    Note: Each of two lists can contain hundreds of thousands of names - some of which may contain Unicode characters.

    Any brainstorming ideas of alternate approaches are appreciated.

  2. #2
    VB-aholic & Lovin' It LaVolpe's Avatar
    Join Date
    Oct 2007
    Location
    Beside Waldo
    Posts
    19,541

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Hundreds of thousands of strings per list? Maybe using a recordset may be more memory efficient, using the string as a primary key to prevent duplicates? Not sure what kind of speed hits you'll encounter.

    If the lists weren't so large, I'd suggest binary sorting to quickly identify duplicates. But that does require keeping your lists sorted as you create/append to them and resizing arrays as needed. Binary sorting can rule out a duplicate in less than 20 iterations among a "collection" of 1 million entries.

    You should get plenty of suggestions and ideas...

    Edited. You might want to describe how you are getting those strings? Do they exist in a file or somewhere else? If so, there are ways to organize a file or creating/using indexes to quickly access the data from file vs. keeping it all in memory.
    Last edited by LaVolpe; May 26th, 2018 at 10:54 AM.
    Insomnia is just a byproduct of, "It can't be done"

    Classics Enthusiast? Here's my 1969 Mustang Mach I Fastback. Her sister '67 Coupe has been adopted

    Newbie? Novice? Bored? Spend a few minutes browsing the FAQ section of the forum.
    Read the HitchHiker's Guide to Getting Help on the Forums.
    Here is the list of TAGs you can use to format your posts
    Here are VB6 Help Files online


    {Alpha Image Control} {Memory Leak FAQ} {Unicode Open/Save Dialog} {Resource Image Viewer/Extractor}
    {VB and DPI Tutorial} {Manifest Creator} {UserControl Button Template} {stdPicture Render Usage}

  3. #3
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    5,120

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Quote Originally Posted by AAraya View Post
    2) The performance of adding items to a collection (with keys) and using error handling to filter out duplicates may not be the fastest way to achieve this.
    The performance penalty of using err handling comes from VB6 populating Error object on failure. I've been using a tweaked VBA.Collection interface that *does not* raise errors on all its methods but returns the HRESULT as Long. This way failure cases on searching (Item function) or adding (Add sub) are not penalized and the performance is orders of magnitude better.

    Whether or not you are going to be using built-in Collection, still no other data structure can beat a good old hashtable in your use-case performancewise IMO.

    cheers,
    </wqw>

  4. #4
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Well perhaps something like this hacked version of my HashTub class might work.

    This one doesn't store separate Key and Item values, but just uses Item for both. Access is only by index but it does offer an Exists by Item value.

    Here is how I created test items to add:

    Code:
        With New HashTubString
            Population = 1000000
            Prt "Population" & vbTab & Format$(Population, "#,##0") & " items"
            Prt "Overflow chunk" & vbTab & Format$(.Chunk, "#,##0") & " items"
            For I = 1 To Population
                'Create item values with some but not too much uniqueness:
                .Add MonthName((I Mod 12) + 1) & CStr(I Mod Population \ 4)
            Next
    Duplicates get eliminated. "Add" is the time it took to add items. "Count" is how many unique items got stored. "Enum" is the time it took to iterate over the list of items checking for adjacent duplicates (which shouldn't be there of course, but I wanted something within the loop).

    Name:  sshot.png
Views: 514
Size:  2.6 KB
    Attached Files Attached Files

  5. #5
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Quote Originally Posted by dilettante View Post
    Well perhaps something like this hacked version of my HashTub class might work.

    This one doesn't store separate Key and Item values, but just uses Item for both. Access is only by index but it does offer an Exists by Item value.

    Here is how I created test items to add:

    Code:
        With New HashTubString
            Population = 1000000
            Prt "Population" & vbTab & Format$(Population, "#,##0") & " items"
            Prt "Overflow chunk" & vbTab & Format$(.Chunk, "#,##0") & " items"
            For I = 1 To Population
                'Create item values with some but not too much uniqueness:
                .Add MonthName((I Mod 12) + 1) & CStr(I Mod Population \ 4)
            Next
    Duplicates get eliminated. "Add" is the time it took to add items. "Count" is how many unique items got stored. "Enum" is the time it took to iterate over the list of items checking for adjacent duplicates (which shouldn't be there of course, but I wanted something within the loop).

    Name:  sshot.png
Views: 514
Size:  2.6 KB
    It doesn't have Remove method.
    Is there any implement similar to .NET SortedDictionary in VB6?

  6. #6
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    No, it has no remove method. No, it doesn't create a sorted list. No, it doesn't help you lose pounds fast. No, it doesn't taste great on ice cream.

    None of those were listed as requirements, but if you want to implement those features have at it.

  7. #7
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,418

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    This kind of reminds of a FULL OUTER JOIN in SQL

    http://www.vbforums.com/showthread.p...n-alternatives
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  8. #8

  9. #9
    PowerPoster Zvoni's Avatar
    Join Date
    Sep 2012
    Location
    To the moon and then left
    Posts
    4,418

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    I just might have had a monumentally stupid idea
    Please bear with me.
    Condition: Both Arrays are unique in itself (meaning: no duplicates in itself)
    Algorythm as follows: (AIRCODE!!)
    Code:
    Dim arrMaster() As String
    Dim sTemp As String
    Dim i As Long
    
    sTemp=Join(MyArray1,"@:@") 'Or use any unique Delimiter, that doesn't appear in any of both Arrays
    
    For i=LBound(MyArray2) To UBound(MyArray2)
    If Not InStr(1, sTemp, MyArray(i), vbBinaryCompare) Then   'or use vbTextCompare for case-insensitive
    sTemp=sTemp & "@:@" & MyArray2(i)
    End If
    Next
    arrMaster=Split(sTemp, "@:@")
    'send arrMaster to QuickSort
    Issues:
    No ideas about performance (as i said: maybe a monumentally stupid Idea)
    No idea about sTemp exceeding the 2GB-Limit

    Now you can start laughing and ridiculing me...
    Last edited by Zvoni; Tomorrow at 31:69 PM.
    ----------------------------------------------------------------------------------------

    One System to rule them all, One Code to find them,
    One IDE to bring them all, and to the Framework bind them,
    in the Land of Redmond, where the Windows lie
    ---------------------------------------------------------------------------------
    People call me crazy because i'm jumping out of perfectly fine airplanes.
    ---------------------------------------------------------------------------------
    Code is like a joke: If you have to explain it, it's bad

  10. #10

    Thread Starter
    Hyperactive Member
    Join Date
    Aug 2011
    Location
    Palm Coast, FL
    Posts
    416

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    A couple of you mentioned sorting of the list... Well, I didn't mention it in the original post since I didn't think it relevant to my original question but yes, the master list does need to be sorted. I was simply going to take the master list compiled by this routine and sort it. But perhaps it is relevant to deciding the best data structures and algorithms to handle this entire task.

    So, the complete task is:

    Given two collections of file names
    Create one master list of all file names with no duplicates (case-insensitive, i.e. "file.txt" = "FILE.TXT")
    The master list should be sorted using the same logic as Explorer (using StrCmpLogicalW)
    The solution should be as fast and as memory efficient as possible and scalable to hundreds of thousands of files - maybe even approaching 1 million files, if possible..
    Last edited by AAraya; May 27th, 2018 at 07:19 PM.

  11. #11

    Thread Starter
    Hyperactive Member
    Join Date
    Aug 2011
    Location
    Palm Coast, FL
    Posts
    416

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    I'm sorry but I don't read Russian. Can you explain what this post thread is about? If you're directing me to a code sample, is it the Add routine in your code that you are showing me?

    Code:
    Private Sub Add(Items() As HashItem, Value As String)
        Dim I   As Long
        Dim N   As Long
        I = CalcHash(Value)
        If Len(Items(I).Value) Then
            If StrComp(Items(I).Value, Value, vbTextCompare) = 0 Then Exit Sub
            Do While Items(I).Next
                I = Items(I).Next
                If StrComp(Items(I).Value, Value, vbTextCompare) = 0 Then Exit Sub
            Loop
            N = UBound(Items) + 1
            ReDim Preserve Items(N)
            Items(N).Value = Value
            Items(I).Next = N
        Else: Items(I).Value = Value
        End If
    End Sub

  12. #12

    Thread Starter
    Hyperactive Member
    Join Date
    Aug 2011
    Location
    Palm Coast, FL
    Posts
    416

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Quote Originally Posted by wqweto View Post
    The performance penalty of using err handling comes from VB6 populating Error object on failure. I've been using a tweaked VBA.Collection interface that *does not* raise errors on all its methods but returns the HRESULT as Long. This way failure cases on searching (Item function) or adding (Add sub) are not penalized and the performance is orders of magnitude better.

    Whether or not you are going to be using built-in Collection, still no other data structure can beat a good old hashtable in your use-case performancewise IMO.
    </wqw>
    This was helpful info, thank you. I was already able to improve performance using some of the insights you provided here. Thank you!

  13. #13
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    Quote Originally Posted by AAraya View Post
    So, the complete task is:

    Given two collections of file names
    Create one master list of all file names with no duplicates (case-insensitive, i.e. "file.txt" = "FILE.TXT")
    The master list should be sorted using the same logic as Explorer (using StrCmpLogicalW)
    The solution should be as fast and as memory efficient as possible and scalable to hundreds of thousands of files - maybe even approaching 1 million files, if possible..
    The RC5s cSortedDictionary can accomplish all that already at "Adding-Time" in one go (being always sorted at any time):
    Code:
    'construction
    Set D = New_c.SortedDictionary
        D.SetStringCompareFlags cmpLogical
    
    'and later usage 
        If Not D.Exists(strNewCandidate) Then D.Add strNewCandidate
    If you pass D along as a Parameter into your other routines (which formerly filled your two "Source-Collections"),
    you don't need to populate these Extra-Containers at all - just the magenta-colored line above would be enough.

    Be advised, that StrComp-Logical is an expensive Comparison (in a Sort-Algo, or a sorting collection).
    Normal Case-Insensitive Comparisons/Sorts are about factor 3 faster, as the following example shows:
    Code:
    Option Explicit
    
    Private Sub Form_Click()
        AutoRedraw = True: Cls
        Dim D As cSortedDictionary, i As Long
    
        Const Count As Long = 500000
     
        ReDim SArr(0 To Count - 1) As String
        For i = 0 To Count - 1
          SArr(i) = MonthName((i Mod 12) + 1) & CStr(i Mod Count \ 4)
        Next
     
        New_c.Timing True
          Set D = New_c.SortedDictionary(TextCompare) 
          For i = 0 To Count - 1
            If Not D.Exists(SArr(i)) Then D.Add SArr(i)
          Next
        Print "case-insensitive normal:", D.Count; New_c.Timing
        
        Set D = Nothing 'clear the Dictionary
        
        New_c.Timing True
          Set D = New_c.SortedDictionary
              D.SetStringCompareFlags cmpLogical
          For i = 0 To Count - 1
            If Not D.Exists(SArr(i)) Then D.Add SArr(i)
          Next
        Print "case-insensitive logical:", D.Count; New_c.Timing
    End Sub
    HTH

    Olaf

  14. #14
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: Fastest, most efficient way to create one unique list of strings from two other l

    For a long time I have these 2 routines for merging sorted string arrays.
    Maybe they can be of any use to you.
    Code:
    Private Function MergeIntersection(List1() As Long, List2() As Long, lOutput() As Long) As Boolean
      Dim l1 As Long, l2 As Long
      Dim lU1 As Long, lU2 As Long
      Dim lValue1 As Long, lValue2 As Long
      Dim lCnt As Long
    
      ' If 1 of the arrays is empty then the output is also empty
      If Not IsDimmed(List1) Then
        Erase lOutput
        Exit Function
      End If
    
      If Not IsDimmed(List2) Then
        Erase lOutput
        Exit Function
      End If
    
      lU1 = UBound(List1)
      lU2 = UBound(List2)
    
      ReDim lOutput(lU1)
    
      Do
        lValue1 = List1(l1)
        lValue2 = List2(l2)
        If lValue1 = lValue2 Then
          lOutput(lCnt) = lValue1
          lCnt = lCnt + 1
          l1 = l1 + 1
          l2 = l2 + 1
        ElseIf lValue1 > lValue2 Then
          l2 = l2 + 1
        Else
          l1 = l1 + 1
        End If
      Loop Until l1 > lU1 Or l2 > lU2
    
      If lCnt = 0 Then
        Erase lOutput
      Else
        ReDim Preserve lOutput(lCnt - 1)
        MergeIntersection = True
      End If
    End Function
    
    Private Function MergeUnion(List1() As Long, List2() As Long, lOutput() As Long) As Boolean
      Dim l1 As Long, l2 As Long
      Dim lU1 As Long, lU2 As Long
      Dim lValue1 As Long, lValue2 As Long
      Dim lCnt As Long, i As Long
      Dim bDimmed1 As Boolean, bDimmed2 As Boolean
      
      bDimmed1 = IsDimmed(List1)
      bDimmed2 = IsDimmed(List2)
      
      ' If both arrays are empty then the result is also empty
      If Not bDimmed1 And Not bDimmed2 Then
        Erase lOutput
        Exit Function
      End If
    
      If Not bDimmed1 Then
        lOutput = List2
        MergeUnion = True
        Exit Function
      End If
      
      If Not bDimmed2 Then
        lOutput = List1
        MergeUnion = True
        Exit Function
      End If
    
      lU1 = UBound(List1)
      lU2 = UBound(List2)
    
      ReDim lOutput(lU1 + lU2 + 1)
    
      lValue1 = List1(l1)
      lValue2 = List2(l2)
    
      Do
        If l1 > lU1 Then
          For i = l2 To lU2
            lValue2 = List2(i)
            If lValue2 <> lValue1 Then
              lOutput(lCnt) = lValue2
              lCnt = lCnt + 1
            End If
          Next i
          Exit Do
          
        ElseIf l2 > lU2 Then
          For i = l1 To lU1
            lValue1 = List1(i)
            If lValue1 <> lValue2 Then
              lOutput(lCnt) = lValue1
              lCnt = lCnt + 1
            End If
          Next i
          Exit Do
        
        Else
          If lValue1 = lValue2 Then
            lOutput(lCnt) = lValue1
            l1 = l1 + 1
            l2 = l2 + 1
            If l1 <= lU1 Then lValue1 = List1(l1)
            If l2 <= lU2 Then lValue2 = List2(l2)
          ElseIf lValue1 > lValue2 Then
            lOutput(lCnt) = lValue2
            l2 = l2 + 1
            If l2 <= lU2 Then lValue2 = List2(l2)
          Else
            lOutput(lCnt) = lValue1
            l1 = l1 + 1
            If l1 <= lU1 Then lValue1 = List1(l1)
          End If
          lCnt = lCnt + 1
        End If
      Loop Until l1 > lU1 And l2 > lU2
    
      If lCnt = 0 Then
        Erase lOutput
      Else
        ReDim Preserve lOutput(lCnt - 1)
        MergeUnion = True
      End If
    End Function

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width