Re: What is the most fastest way to see if a string already exists in an array
There was a discussion a while back about fast searching methods using strings...I wrote this bit of code back then, which stores every string in one big string with a specific code dividing them (a unique code not in any of the strings)...the code it uses is "¬" so if you wanted to search for the word dog, you would search the string for "¬dog¬" and if it returns a result it's in there...if you *just* searched for "dog" you might get "doghouse", "doggone", etc and this method ensures exact match :-)
You might find this fast enough for your needs, but be aware it uses up quite a bit of memory if you're gonna have 25000 entries...if each word is 8 or less characters, the total memory taken up would be anything up to 200k with 25k for overheads (all the ¬)...at 60 chars each that'd be 1.5MB
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Re: What is the most fastest way to see if a string already exists in an array
Another way that I have done this is placing the strings in a Hidden Listbox, you can use the SendMessageAPI to determine if the string is in the list... Very quick....
VB Code:
Public Const LB_FINDSTRING = &H18F
Public Const LB_FINDSTRINGEXACT = &H1A2
Public Declare Function SendMessageString Lib "user32" Alias "SendMessageA" (ByVal hwnd as long, ByVal wMsg as Long, ByVal wParam As Long, Byref lParam as String) as Long
Re: What is the most fastest way to see if a string already exists in an array
Yeah, a sorted array being used to search probably would be the fastest method available, however it requires that you keep the array sorted which in itself would probably take lots of processor power to do...a simple bubble sort would sort the array though. If your list of strings is going to be dynamic (always changing) my code is probably faster (if you need to remove a word, eg "dog", you would just do replace(string,"¬dog¬","¬")) but if your strings are mostly static (never changing) then using szlamany's suggestion would probably be better
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Re: What is the most fastest way to see if a string already exists in an array
You don't need to sort the array of strings - sort a pointer array of integers instead.
Use a QUICK SORT or SHELL SORT - they are fastest (which one can be debated).
If you need to INSERT new entries then it gets much more complex - you can try to amend the pointer-array of sort entries - but that's potentially buggy...
Coming at this from a DB approach you really should load all new entries into a staging table and then INSERT into the production table from that - with something like:
Code:
INSERT INTO yIds
Select yId From TempTable TT
Left Join yIds YI on YI.yId=TT.yId
Where YI.yId is null
or...
Code:
Delete From TempTable Where yId in (Select yId From yIds)
Insert into yIds Select yId From TempTable
and probably several other ways to go about it.
Give us more background - where are these new Ids coming from? Why 50 a second? How quickly do the new ids need to get into the DB?
*** Read the sticky in the DB forum about how to get your question answered quickly!! ***
Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".
Re: What is the most fastest way to see if a string already exists in an array
I was considering adding that you could use an array of numbers (rather than moving the text around, the numbers would be less processing) but felt that it was more memory wasted...I don't think that speed is an issue when sorting the data, although it is an issue when searching for an entry. The issue here for me is that having to sort 25,000 entries is going to take a while whatever way you do it...with my instr code you don't need to sort anything, which is why it's better suited to this if the array will be dynamic.
You know, I really should stress-test my prog...give it a large text document and have it split the file into words and add them all into the word list or something...see how long it takes to do them all :-)
Edit: Did a test...130965 words (700k file, "Isaac Asimov - Prelude to Foundation.txt") went through the program in around 140 minutes...that's about 15 words a second, so it's a bit slow...but I did run it in IDE without any speedup optimization or anything so it could be improved on, I am sure :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Re: What is the most fastest way to see if a string already exists in an array
Heres a little more info
The app is somewhat of a webpage crawler, grabbing information from websites (IDs) and storing it in a database.
It works using about 200 winsock threads working at once. It retrieves on average, 50 IDs a second. These IDs need to be unique, so i insert the value into a database (and if it fails, i know it exists).
I cannot just save all the ids to a file at once and then add them later, as work needs to be done on the ids as soon as its determined it is unique.
Right now the database is 25k IDs big. It will only get bigger
I was thinking of a sort but the problem is I need to add the IDs too.. (not just retrieve). Perhaps I could have a sort on the 25k IDs already in the database, check for duplicate on those first, and then use a dictionary object or something to see if there is a duplicate present on new IDs obtained on the fly?
Re: What is the most fastest way to see if a string already exists in an array
So starting with a sorted array of 25K id's is no problem - right?
Doing a binary search on those 25K items is quick and most efficient - right?
Then you need to handle the "new ids" - determining if they have come in recently.
You can use the string/INSTR method - see if that's sufficient. It won't have 25K entries - only the number just processed as new.
Or if this proves to be a problem then you are loading them into a link-list array in sorted position - this is standard practice, just needs lots of testing.
*** Read the sticky in the DB forum about how to get your question answered quickly!! ***
Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".
Re: What is the most fastest way to see if a string already exists in an array
A binary search is basically what the non-idiotic Price is Right contestants do on that beat the clock game.. Say you have 1024 possibilities:
You guess 512 and Bob says "HIGHER!"
You guess 768 (512+256) and Bob says "HIGHER"
You guess 896 (512+256+128) and Bob says "LOWER"
You guess 832 (512+256+128-64) and Bob says "HIGHER"
You guess 864 (512+256+128-64+32) and Bob says "HIGHER"
You guess 880 (512+256+128-64+32+16) and Bob says "LOWER"
You guess 872 (512+256+128-64+32+16-8) and Bob says "LOWER"
You guess 868 (512+256+128-64+32+16-8-4) and Bob says "HIGHER"
You guess 870 (512+256+128-64+32+16-8-4+2) and Bob says "LOWER!!"
You guess 869 (512+256+128-64+32+16-8-4+2-1) Then the music starts playing and you get to pay taxes on something you didn't really want/need..
BTW, 1024=2^10 obviously and I picked a worst case scenario where 10 iterations were needed since we weren't lucky enough to find it outright part way through
Re: What is the most fastest way to see if a string already exists in an array
this works
Dim arr() As String
Dim i As Long
ReDim arr(100000)
For i = 0 To UBound(arr)
arr(i) = Format$(i, "000000")
Next
MsgBox BinarySearch(arr, "001000") ' = -1
Re: What is the most fastest way to see if a string already exists in an array
Originally Posted by VaxoP
thanks guys
binary search was 156x faster lol
Now I feel i need to reiterate the facts...Binary search *IS* a lot faster than any other method but it REQUIRES that the data be sorted...if you have dynamic data (always changing) which I think you do considering you are looking for duplicates and adding the data if it isn't a duplicate, then this option WILL NOT WORK as you need to sort the list each time. Now, you COULD use a static list that's pre-sorted and have a dynamic list that's sorted each time you add a word (I think this was mentioned before...not sure)...this would result in a very flexible go-between that would allow you to use this method with any size database but you may want to merge the dynamic and static list. If you plan to merge the two lists however, seeing as you have the lists split, you could probably add each word one at a time rather than just chucking them in and sorting the list after...you just find the position where the word should be then move everything after it to the next position and add the word (or work out what position all the words in dynamic list should go into the static list and do a clever merge after making space)
The bottom line here is a basic binary sort is not going to work with a list of 25,000 words if you are adding words to it...but it can be made to work :-)
I love helping noobs with their VB problems (probably because, as an amateur programmer, I am only slightly better at VB than them :-)) but if you SERIOUSLY want to get help for free from a community such as VBForums, you have to first have a grounding (basic knowledge) in VB6, otherwise you're way too much work to help...You've got to give a little if you want to get help from us, in other words!
And we DON'T do your homework. If your tutor doesn't teach you enough to help you make the project without his or her help, FIND A BETTER TUTOR or try reading books on programming! We are happy to help with minor things regarding the project, but you have to understand the rest of it if you want our help to be useful.
Re: What is the most fastest way to see if a string already exists in an array
Ok - here's a little reality on how SQL and other ISAM-based index systems work.
First they all use BINARY searches to look through the INDEX for an entry. There is basically nothing faster.
And new entries are added to an INDEX all the time...
But the INDEX is not re-sorted after each INSERT.
Most of the "buckets" of data in an index have free space - you can select the percentage of free space. So if 50% of a bucket is empty, new entries are put into that free space - after finding the proper spot in the sorted order.
When a bucket fills a split occurs - pointers are added to the spot where the split occured indicating where the two split buckets of index leaves are now residing.
At some point in time the engine determines that an index would be better served by rebuilding it - and the sort occurs...
The reason I mention all this is just to point out how far you can truly go with this type of concept.
Re-sorting the list is not required - simply getting around the sort issue is.
If you maintain a pointer array - all you have to do is open a spot in the pointer array to fit the new entry in. To clarify, the pointer array simply is an array of INTEGER values - matching the size of the "string array" of actual entries. It starts off it's life with Pointer(1)=1, Pointer(2)=2, Pointer(3)=3, Pointer(x)=x - with x being the size of the array.
When you have to insert a new entry into the array, it goes into the "last" spot of the "string array" - let's say StringValue(25001)="some new value".
If that new value belongs in the 5000th spot in regard to it's sorted position, then you need to move each Pointer(x) up one spot - basically Pointer(25001)=Pointer(25000), Pointer(25000)=Pointer(24999) - done with a simple loop.
You do this all the way to Pointer(5001)=Pointer(5000).
This frees up Pointer(5000) - which you now load with the number 25001 (since that is where you put the new value).
Back in our mainframe days we would have used a "memory move" operation to actually shift this data around, so looping was not required.
Although I still feel, as stated in a prior post, that putting the new entries into the main 25000 entry array is simply not needed. You can maintain a separate array of new entries - or a string of new entries and do an INSTR on that string. You could consider re-loading the main array after so main strings get put into the "second-tier-search-array".
*** Read the sticky in the DB forum about how to get your question answered quickly!! ***
Please remember to rate posts! Rate any post you find helpful - even in old threads! Use the link to the left - "Rate this Post".
Re: What is the most fastest way to see if a string already exists in an array
Originally Posted by VaxoP
this works
VB Code:
Dim arr() As String
Dim i As Long
ReDim arr(100000)
For i = 0 To UBound(arr)
arr(i) = Format$(i, "000000")
Next
MsgBox BinarySearch(arr, "001000") ' = -1
Ok. Now if youll use fixed length strings like that, maybe you could consider using VB native filter. This works even if the array is not sorted. Its slower than BinarySearch, but not so much..
VB Code:
If UBound(Filter(arr, "095550", True)) = -1 Then
MsgBox "Item Not found"
Else
MsgBox "Item found"
End If
Filter has even more functionality than this, it returns..
- an array of matches (using True as 3rd parameter)
- an array of no matches (using False as 3rd parameter)