how do i filter an array using another array?
Printable View
how do i filter an array using another array?
Can you give an example ? or maybe explain more ?
given an array A, remove all duplicates of array element B from A
If your data comes from a database, then something like this would be much easier to do on the SQL side...
Removing items in an array is never easy. If you convert the array into a collection, then it will be much easier. Is that an option for you ? or does it HAVE to be an array ?
By the way, how many items in the array ? thousands, millions ?
The brute force ways are relatively simple--loop over A and check if each element is in B. If not, add it to the end of another data structure--ReDim Preserve would work, for instance. This is hideously inefficient, but for smallish arrays it just doesn't matter.
More efficient methods could use a hash function. Hash B into its own boolean array HashB. Iterate through A, hashing each A(a). Check if HashB(Hash(A(a))) is true; if so, and you didn't have any collisions while making HashB, mark a down in a separate boolean array the same length as A. If you did have collisions, first test if A(a) = B(b) for each b that went into Hash(A(a))--you'd need to keep a list somehow. Finally, after you've finished marking which ones to delete, create a whole new array by iterating through your boolean marks. You should keep a count of how many deleted items there are so you can figure out the length of the final array in one operation.
Sounds like both your methods are pretty much the same, actually your seocnd method has more unneeded work.
But as you said for a resonable array size, for say arrays "a" and "b", loop through "a" and "b" at same time and check for any NON duplicates send these non(or not non) duplicates to a new array(tmpArr), then if you like erase the old array and add these temp array elements dups back to the old array.
If it was a big array and going to eat up your system i would write it in C++ (but i always say that :) ) you can call the c++ from VB if you like, still much faster.
coolcurrent4u has only given us a poor description of what he wants. The datatype is really quite important as are the quantities (A & B).
More info would be nice, but just assuming you have fixed lengh string arrays and you want to remove from array a all items present in array b you could use a Func like this:
Working example:Code:Private Function FilterArray(a() As String, b() As String) As String()
Dim i As Long
For i = LBound(b) To UBound(b)
a = Filter(a, b(i), False)
Next
End Function
EDIT: Now i remember this will also exclude partial matches, it will work just if those arrays contain fixed lengh strings, else: if A contains an item "DOG" and B contains an item "DO" then item A "contains" item B, and "DOG" will be removed from array A, and i don't think you want this to happen.Code:Dim a() As String, b() As String, i As Long
ReDim a(6) 'Will have 7 items (a,b,c,d,e,f,g)
ReDim b(3) 'Will have 4 items (a,b,c,d)
'Fill Arrays..
For i = 0 To 6
a(i) = Chr(97 + i)
Next
For i = 0 To 3
b(i) = Chr(97 + i)
Next
'Call the Func
FilterArray a(), b()
'Show Results: (e,f,g)
For i = LBound(a) To UBound(a)
Debug.Print a(i)
Next
The array elements are strings, and am processing millions of it. Specifically, am using all words from english dictionary :eek: as arrayA, and arrayB is words i specify
arrayA = this, is, one, array, with, strings, elements
arrayB = this, is, one
Remove all elements of arrayB from arrayA, so arrayC should be
arrayC =array, with, strings, elemenet
thanks to all that contributed, i have finally come up with a solution.
please see the attached. if you can make it faster or optimize it, please let me know
I've not looked at your solution.
I think the easiest way to implement this would be with a database.
If you wanted to use some sort of custom system then here are my thoughts.
The big list of words is fixed so there a few things you can do to optimise it for searching. You can order it alphabetically and by word length, i.e. all the 6 letter words are grouped together in alphabetical order. You can then use a binary search to locate (or not) the word in the respective subset. This has the potential to locate (or not) any word by only comparing a few strings (A subset of 1'000'000 would require a max of 20 comparisons). You could also break down each subset further by indexing by say first letter but it would not be a great saving as binary searching is so effective on it's own.
Finally the array you build with the results does not have to be a string array, it could be a array of long that stores the indexes from the original big list.
memory wise...
OED2 defines over 600'000 words many of these are very uncommon so you could decide just to use a subset but if you go with the big list... If the average word length was 7 and you stored each one as a separate string it would take something like 14Mb ram (similar to a 24bpp bitmap 2200x2200). You could instead use ANSI strings and rather than have each word as a separate string you could divide the words up by their length and store each group in a single string, reducing the footprint to 4Mb or so. You would then use MidB$ to perform the binary search.
if you say i should use a database, do you mean i have to store each search in a database before actually performing the search?
i have modified the two functions to produce one
vb Code:
'-------------- ' Filter Keep Words '-------------- Function FilterArray(ByVal Source As String, ByVal Search As String, Optional _ ByVal Keep As Boolean = True) As String Dim i As Long, j As Long, k As Long Dim SourceArray() As String Dim SearchArray() As String Dim strTemp() As String Dim sTemp As String Dim iSearchLower As Long Dim iSearchUpper As Long Dim iSourceLower As Long Dim iSourceUpper As Long 'TEST IF WE HAVE VALUES If LenB(Source) <> 0 And LenB(Search) <> 0 Then SourceArray = Split(Source, " ") SearchArray = Split(Search, " ") Else FilterArray = Source Exit Function End If iSearchLower = LBound(SearchArray) iSearchUpper = UBound(SearchArray) iSourceLower = LBound(SourceArray) iSourceUpper = UBound(SourceArray) '// Check if to keep or remove If Keep = True Then For i = iSearchLower To iSearchUpper For j = iSourceLower To iSourceUpper If InStrB(SourceArray(j), SearchArray(i)) <> 0 Then k = k + 1 ReDim Preserve strTemp(1 To k) strTemp(k) = SourceArray(j) DoEvents End If Next j Next i FilterArray = Join(strTemp, " ") Else For i = iSearchLower To iSearchUpper DoEvents Source = Join(Filter(Split(Source, " "), SearchArray(i), False, _ vbTextCompare), " ") Next i FilterArray = Source End If End Function
I said a database would probably be the easiest to implement.
I went on to describe techniques you could use if you did not use a database.
Do you know what a binary search is?
i do not have much information on binary search, perhaps you could help me know more, and how to implement it.
To save you the trouble of typing 'binary search' in a text box, does this help? Give the whole thread a quick read.
The line "ReDim Preserve strTemp(1 To k)" from post #13 is not a good idea with arrays your size because VB uses very stupid array resizing techniques with the Preserve keyword. You should instead try
They're not even remotely similar. The first method I mentioned does incremental memory reallocations while the second does a single one. The first reads each array object in its entirety to test for equality while the second only requires a hashing function which can read a small fraction of the object if well crafted. The second is wildly more efficient, asymptotically, though more complicated; the first is naive. I'm not sure how you could think they're similar, and it's a little insulting that you said the second does unnecessary work apparently without understanding it.Code:ReDim strTemp(1 To iSourceUpper - iSourceLower + 1)
For i = iSearchLower To iSearchUpper
For j = iSourceLower To iSourceUpper
If InStrB(SourceArray(j), SearchArray(i)) <> 0 Then
k = k + 1
strTemp(k) = SourceArray(j)
DoEvents
End If
Next j
Next i
FilterArray = Join(strTemp, " ")
FilterArray = Trim$(FilterArray)
But as it turns out the OP's "array" is not an array at all, but a space-separated string of words. My above methods would need modification to work in this case, though the "Keep = True" case of post #13 is basically my first method.
ok, i have modified the function even more with less loop, please tell me what you think
vb Code:
Function FilterArray(ByVal Source As String, ByVal Search As String, Optional _ ByVal Keep As Boolean = True) As String Dim i As Long Dim SearchArray() As String Dim iSearchLower As Long Dim iSearchUpper As Long If LenB(Source) <> 0 And LenB(Search) <> 0 Then SearchArray = Split(Search, " ") Else FilterArray = Source Exit Function End If iSearchLower = LBound(SearchArray) iSearchUpper = UBound(SearchArray) For i = iSearchLower To iSearchUpper DoEvents Source = Join(Filter(Split(Source, " "), SearchArray(i), Keep, _ vbTextCompare), " ") Next i FilterArray = Source End Function
I don't know what Jem is going to say but I suggest either work with a large string or work with a string array, don't split the large string for every search. A binary search would be much much quicker, although a well implemented hash function/hash table system has potential to be quicker still.
Your method certainly isn't the most efficient I can think of, but if you're happy with the time it takes, then great. The Keep=False case is particularly slow since you're Splitting potentially huge amounts of text and Joining them repeatedly which requires lots of data copying. The Keep=True case is probably quite quick since your first Filter will remove all elements but one, excepting duplicates. For instance, 'FilterArray("cow dog donkey", "dog")' returns "dog" and 'FilterArray("cow dog donkey", "dog donkey")' returns "". This is probably not the behavior you actually wanted, since FilterArray becomes a sort of boolean InStr, so I assume you've only been using Keep=False.
Are you unhappy with your current speed? You could avoid repeated Split and Joins by operating on an intermediate array instead. That should be faster.
Edit: In the Keep=True case, I agree with Milk--keeping a sorted string array of Source and applying binary search is far, far more efficient. In the Keep=False case, this doesn't work well.
i found a class of a hash table, but i can't or haven't found a binary search implementation in vb yet. Besides, there are many search algorithms out there like radix, balanced tree, binary tree search, interpolation etc, some of which is better that binary searching.
Before you suggested binary, i believe you have an implementation or have been using binary search.
Can you help me with some implementation?
How can i apply a hash table to my problem?
I would forget about hash functions / hash tables, it's more advanced than a binary search which you appear to be struggling with, and it won't be particularly easy to implement effectively in VB6
I have already linked to a thread with a couple of VB6 binary search implementations you could just copy a paste, as well as a Wikipedia article which discusses the algorithm in some detail. I can't really be more explicit.
ok now lets assume i want to use a binary search and filter elements
i have to sort the arrayA and arrayB
do the search removing every element of arrayB from arrayA
i definitely need a temporary array to copy to or do some sort of redim preserve
that been said, i think this will take more time and memory, and going to implement it anyway and see the difference. Am concerned about large arrayA and or large arrayB. Any more suggestions will be appreciated
No you just need to sort array A, this just needs to be done once during the lifetime of the program. You would pass the binary search function, array A and a single string from array B. You would do this for every string in Array B. You actually do not have to redim preserve anything if you use the binary search function to populate a boolean (or bitset byte) array from which you build your final array, C.
Sure:
This should be far more efficient than what you had, but it still copies huge chunks of Source repeatedly, which is inefficient when Source is very large.Code:Function FilterArray(ByVal Source As String, ByVal Search As String, Optional _
ByVal Keep As Boolean = True) As String
Dim i As Long
Dim SearchArray() As String
Dim IntermediateArray() As String
Dim iSearchLower As Long
Dim iSearchUpper As Long
If LenB(Source) <> 0 And LenB(Search) <> 0 Then
SearchArray = Split(Search, " ")
Else
FilterArray = Source
Exit Function
End If
iSearchLower = LBound(SearchArray)
iSearchUpper = UBound(SearchArray)
IntermediateArray = Split(Source, " ")
For i = iSearchLower To iSearchUpper
DoEvents
IntermediateArray = Filter(IntermediateArray, SearchArray(i), Keep, _
vbTextCompare)
Next i
FilterArray = Join(IntermediateArray, " ")
End Function
I agree with Milk that my hash table idea is too advanced (at least in VB6), unless this is incredibly important. The various other search methods are probably also more advanced than you're looking for, consider VB6's almost complete lack of a general purpose class library. I can only imagine the horror that would be a Red/Black tree implementation. (Off topic: for those interested, here's an amazing implementation of Red/Black trees in Haskell, a functional language. It's so brief and cute, it's remarkable. Also, the linked paper was written by a professor of mine's student or friend, I can't quite remember which.)
To elaborate slightly on Milk's solution, you would do something like this pseudovbcode:
Code:'Assumes A is a string array instead of a single string
' A is sorted and probably globally available, for fast binary search.
'Similarly B is a string array.
Dim NotInB(LBound(A) To UBound(A)) As Boolean
For b = LBound(B) to UBound(B)
a = BinarySearch(ByRef A, B(b)) 'Returns index at which B(b) occurs in A, or LBound(A)-1 if not found
If a >= LBound(A) Then
NotInB(a) = True
NumFound = NumFound + 1
End If
Next b
ReDim NewA(LBound(A) To UBound(A) - NumFound) As String
For a = LBound(A) to UBound(A)
If NotInB(a) Then
NumSkipped = NumSkipped + 1
End If
NewA(a - NumSkipped) = A(a)
Next a
'return join(NewA, " ")