What is the most fastest way to see if a string already exists in an array
Anyone know?
Currently I have 25,000 records in a mysql database.
What I am doing right now to see if something exists is the statement..
rs.Open "INSERT INTO yIDs (yID) VALUES ('" & yID & "')", Mysql_Connection
if the string doesnt exist, it is added to the database. if it does, i can tell as an error is raised (since yID is a primary key)
but i need something faster. I need to do about 50 queries a second with different strings, and this method takes a lot of CPU.
Is there a faster way anyone can think of? The ids are ~60 characters long. I need to be able to check if it exists in an array, if not add it.
Would building a string of all the ids on form_load, then using instr be faster?
What about arrays? Dictionary/collection objects? Anything else?
Re: What is the most fastest way to see if a string already exists in an array
why don't you use select query for searching record in table ?
i think it will be faster that what you are doing that by insrt query.
1 Attachment(s)
Re: What is the most fastest way to see if a string already exists in an array
There was a discussion a while back about fast searching methods using strings...I wrote this bit of code back then, which stores every string in one big string with a specific code dividing them (a unique code not in any of the strings)...the code it uses is "¬" so if you wanted to search for the word dog, you would search the string for "¬dog¬" and if it returns a result it's in there...if you *just* searched for "dog" you might get "doghouse", "doggone", etc and this method ensures exact match :-)
You might find this fast enough for your needs, but be aware it uses up quite a bit of memory if you're gonna have 25000 entries...if each word is 8 or less characters, the total memory taken up would be anything up to 200k with 25k for overheads (all the ¬)...at 60 chars each that'd be 1.5MB
Re: What is the most fastest way to see if a string already exists in an array
The fastest way to find a string in an array is to store the strings in the array in sorted order.
Then use a simple binary search algorithm - I've written then dozens of times - I'm sure you can search the forum for one here.
Basically it's a high/low search - so if you have 32768 items in an array you can find if one exists in less then 16 iterations...
Nothing gets faster then that.
You certainly do not want to use a string with INSTR - imagine what that does in memory to find your string!
Re: What is the most fastest way to see if a string already exists in an array
Another way that I have done this is placing the strings in a Hidden Listbox, you can use the SendMessageAPI to determine if the string is in the list... Very quick....
VB Code:
Public Const LB_FINDSTRING = &H18F
Public Const LB_FINDSTRINGEXACT = &H1A2
Public Declare Function SendMessageString Lib "user32" Alias "SendMessageA" (ByVal hwnd as long, ByVal wMsg as Long, ByVal wParam As Long, Byref lParam as String) as Long
Dim ListIndex as long
ListIndex = SendMessageString(Me.List1.hWnd, LB_FINDSTRINGEXACT, -1, ByVal (Cstr(MyString))
If ListIndex = -1 Then
' String NOT in the List '
End If
Re: What is the most fastest way to see if a string already exists in an array
Yeah, a sorted array being used to search probably would be the fastest method available, however it requires that you keep the array sorted which in itself would probably take lots of processor power to do...a simple bubble sort would sort the array though. If your list of strings is going to be dynamic (always changing) my code is probably faster (if you need to remove a word, eg "dog", you would just do replace(string,"¬dog¬","¬")) but if your strings are mostly static (never changing) then using szlamany's suggestion would probably be better
Re: What is the most fastest way to see if a string already exists in an array
You don't need to sort the array of strings - sort a pointer array of integers instead.
Use a QUICK SORT or SHELL SORT - they are fastest (which one can be debated).
If you need to INSERT new entries then it gets much more complex - you can try to amend the pointer-array of sort entries - but that's potentially buggy...
Coming at this from a DB approach you really should load all new entries into a staging table and then INSERT into the production table from that - with something like:
Code:
INSERT INTO yIds
Select yId From TempTable TT
Left Join yIds YI on YI.yId=TT.yId
Where YI.yId is null
or...
Code:
Delete From TempTable Where yId in (Select yId From yIds)
Insert into yIds Select yId From TempTable
and probably several other ways to go about it.
Give us more background - where are these new Ids coming from? Why 50 a second? How quickly do the new ids need to get into the DB?
Re: What is the most fastest way to see if a string already exists in an array
I was considering adding that you could use an array of numbers (rather than moving the text around, the numbers would be less processing) but felt that it was more memory wasted...I don't think that speed is an issue when sorting the data, although it is an issue when searching for an entry. The issue here for me is that having to sort 25,000 entries is going to take a while whatever way you do it...with my instr code you don't need to sort anything, which is why it's better suited to this if the array will be dynamic.
You know, I really should stress-test my prog...give it a large text document and have it split the file into words and add them all into the word list or something...see how long it takes to do them all :-)
Edit: Did a test...130965 words (700k file, "Isaac Asimov - Prelude to Foundation.txt") went through the program in around 140 minutes...that's about 15 words a second, so it's a bit slow...but I did run it in IDE without any speedup optimization or anything so it could be improved on, I am sure :-)
Re: What is the most fastest way to see if a string already exists in an array
Heres a little more info
The app is somewhat of a webpage crawler, grabbing information from websites (IDs) and storing it in a database.
It works using about 200 winsock threads working at once. It retrieves on average, 50 IDs a second. These IDs need to be unique, so i insert the value into a database (and if it fails, i know it exists).
I cannot just save all the ids to a file at once and then add them later, as work needs to be done on the ids as soon as its determined it is unique.
Right now the database is 25k IDs big. It will only get bigger
I was thinking of a sort but the problem is I need to add the IDs too.. (not just retrieve). Perhaps I could have a sort on the 25k IDs already in the database, check for duplicate on those first, and then use a dictionary object or something to see if there is a duplicate present on new IDs obtained on the fly?
Re: What is the most fastest way to see if a string already exists in an array
webpage crawler? what kind of id's are u collecting?
Re: What is the most fastest way to see if a string already exists in an array
So starting with a sorted array of 25K id's is no problem - right?
Doing a binary search on those 25K items is quick and most efficient - right?
Then you need to handle the "new ids" - determining if they have come in recently.
You can use the string/INSTR method - see if that's sufficient. It won't have 25K entries - only the number just processed as new.
Or if this proves to be a problem then you are loading them into a link-list array in sorted position - this is standard practice, just needs lots of testing.
Re: What is the most fastest way to see if a string already exists in an array
how can i do a binary search?
Re: What is the most fastest way to see if a string already exists in an array
A binary search is basically what the non-idiotic Price is Right contestants do on that beat the clock game.. Say you have 1024 possibilities:
You guess 512 and Bob says "HIGHER!"
You guess 768 (512+256) and Bob says "HIGHER"
You guess 896 (512+256+128) and Bob says "LOWER"
You guess 832 (512+256+128-64) and Bob says "HIGHER"
You guess 864 (512+256+128-64+32) and Bob says "HIGHER"
You guess 880 (512+256+128-64+32+16) and Bob says "LOWER"
You guess 872 (512+256+128-64+32+16-8) and Bob says "LOWER"
You guess 868 (512+256+128-64+32+16-8-4) and Bob says "HIGHER"
You guess 870 (512+256+128-64+32+16-8-4+2) and Bob says "LOWER!!"
You guess 869 (512+256+128-64+32+16-8-4+2-1) Then the music starts playing and you get to pay taxes on something you didn't really want/need..:)
BTW, 1024=2^10 obviously and I picked a worst case scenario where 10 iterations were needed since we weren't lucky enough to find it outright part way through
Re: What is the most fastest way to see if a string already exists in an array
Quote:
Originally Posted by VaxoP
how can i do a binary search?
Google is your friend. Here's the #1 result from a search of "VB6 binary search". You can also search for "VB6 binary sort" if you need more info.
http://www.devx.com/vb2themax/Tip/18913
Re: What is the most fastest way to see if a string already exists in an array
thanks guys :)
binary search was 156x faster lol
Re: What is the most fastest way to see if a string already exists in an array
Quote:
Originally Posted by Cavar
Strange, testing that, I did this..
VB Code:
Private arr() As String
Private Sub Form_Load()
Dim i As Long
ReDim arr(100000)
For i = 0 To UBound(arr)
arr(i) = i
Next
MsgBox BinarySearch(arr, "1000", 100000) ' = -1
End Sub
Result = -1 (not found) :sick:
Re: What is the most fastest way to see if a string already exists in an array
this works
Dim arr() As String
Dim i As Long
ReDim arr(100000)
For i = 0 To UBound(arr)
arr(i) = Format$(i, "000000")
Next
MsgBox BinarySearch(arr, "001000") ' = -1
Re: What is the most fastest way to see if a string already exists in an array
Quote:
Originally Posted by VaxoP
thanks guys :)
binary search was 156x faster lol
Now I feel i need to reiterate the facts...Binary search *IS* a lot faster than any other method but it REQUIRES that the data be sorted...if you have dynamic data (always changing) which I think you do considering you are looking for duplicates and adding the data if it isn't a duplicate, then this option WILL NOT WORK as you need to sort the list each time. Now, you COULD use a static list that's pre-sorted and have a dynamic list that's sorted each time you add a word (I think this was mentioned before...not sure)...this would result in a very flexible go-between that would allow you to use this method with any size database but you may want to merge the dynamic and static list. If you plan to merge the two lists however, seeing as you have the lists split, you could probably add each word one at a time rather than just chucking them in and sorting the list after...you just find the position where the word should be then move everything after it to the next position and add the word (or work out what position all the words in dynamic list should go into the static list and do a clever merge after making space)
The bottom line here is a basic binary sort is not going to work with a list of 25,000 words if you are adding words to it...but it can be made to work :-)
Re: What is the most fastest way to see if a string already exists in an array
Ok - here's a little reality on how SQL and other ISAM-based index systems work.
First they all use BINARY searches to look through the INDEX for an entry. There is basically nothing faster.
And new entries are added to an INDEX all the time...
But the INDEX is not re-sorted after each INSERT.
Most of the "buckets" of data in an index have free space - you can select the percentage of free space. So if 50% of a bucket is empty, new entries are put into that free space - after finding the proper spot in the sorted order.
When a bucket fills a split occurs - pointers are added to the spot where the split occured indicating where the two split buckets of index leaves are now residing.
At some point in time the engine determines that an index would be better served by rebuilding it - and the sort occurs...
The reason I mention all this is just to point out how far you can truly go with this type of concept.
Re-sorting the list is not required - simply getting around the sort issue is.
If you maintain a pointer array - all you have to do is open a spot in the pointer array to fit the new entry in. To clarify, the pointer array simply is an array of INTEGER values - matching the size of the "string array" of actual entries. It starts off it's life with Pointer(1)=1, Pointer(2)=2, Pointer(3)=3, Pointer(x)=x - with x being the size of the array.
When you have to insert a new entry into the array, it goes into the "last" spot of the "string array" - let's say StringValue(25001)="some new value".
If that new value belongs in the 5000th spot in regard to it's sorted position, then you need to move each Pointer(x) up one spot - basically Pointer(25001)=Pointer(25000), Pointer(25000)=Pointer(24999) - done with a simple loop.
You do this all the way to Pointer(5001)=Pointer(5000).
This frees up Pointer(5000) - which you now load with the number 25001 (since that is where you put the new value).
Back in our mainframe days we would have used a "memory move" operation to actually shift this data around, so looping was not required.
Although I still feel, as stated in a prior post, that putting the new entries into the main 25000 entry array is simply not needed. You can maintain a separate array of new entries - or a string of new entries and do an INSTR on that string. You could consider re-loading the main array after so main strings get put into the "second-tier-search-array".
Re: What is the most fastest way to see if a string already exists in an array
Quote:
Originally Posted by VaxoP
this works
VB Code:
Dim arr() As String
Dim i As Long
ReDim arr(100000)
For i = 0 To UBound(arr)
arr(i) = Format$(i, "000000")
Next
MsgBox BinarySearch(arr, "001000") ' = -1
Ok. Now if youll use fixed length strings like that, maybe you could consider using VB native filter. This works even if the array is not sorted. Its slower than BinarySearch, but not so much..
VB Code:
If UBound(Filter(arr, "095550", True)) = -1 Then
MsgBox "Item Not found"
Else
MsgBox "Item found"
End If
Filter has even more functionality than this, it returns..
- an array of matches (using True as 3rd parameter)
- an array of no matches (using False as 3rd parameter)