1. ## Interpolation Search

To search for strings in a sorted array I use a "BinarySearch" algorithm; I found, in an old manual, the indication of the greater efficiency of the "Interpolation Search" algorithm compared to the "BS"; the transcribed example is for "array long"; does anyone know if it's possible to convert this, to search for a string in an array of strings?

List() is the sorted array to search.

Code:
```' ************************************************
' Locate the item using interpolation search.
' ************************************************
Public Function InterpSearch(target As Long) As Long
Dim min As Long
Dim max As Long
Dim middle As Long

NumSearches = 0

min = 1
max = NumItems
Do While min <= max
' Prevent division by zero.
If List(min) = List(max) Then
' This must be the item (if it's in the list).
If List(min) = target Then
InterpSearch = min
Else
InterpSearch = 0
End If
Exit Function
End If

' Compute the dividing point.
middle = min + (target - List(min)) * _
((max - min) / (List(max) - List(min)))

' Make sure we stay in bounds.
If middle < min Or middle > max Then
' It's not in the list.
InterpSearch = 0
Exit Function
End If

NumSearches = NumSearches + 1
If target = List(middle) Then       ' We found it.
InterpSearch = middle
Exit Function
ElseIf target < List(middle) Then   ' Search the left half.
max = middle - 1
Else                                ' Search the right half.
min = middle + 1
End If
Loop

' If we got to this point, the item is not in the list.
InterpSearch = 0
End Function```

2. ## Re: Interpolation Search

Interpolation search is pretty easy to implement, it involves estimating (interpolating) the point at which the value you are looking for will be. I've never used it, but if I were to do so I would use binary search until I had one point before and one point after the target value. I would then divide the difference between the lower and upper value by the number of values and work out how many times the result fits into our target value and it should point to the location (or a close approximation). But of course, you have code that does that job...you're asking if it can be used in strings, which are essentially binary values underneath. You need to find a way to convert your string to a binary value and then compare it to the array.

Let's start with the obvious...let's assume it's alphanumeric and they're all words in the array, beginning with A-Z. You look at the value of the first character in your string to be searched for, and assign it a value 1-26 based on the character (you can just do 0-255 if you have random binary). Divide the number of array elements by 26 and multiply the result by the value you got...you should be pointed at the upper boundary of where the words with that letter are. if you do the same but divide by 26 then multiply by value - or + 1 (depending on how far off your first result is) you should have an upper and lower bound now. It isn't an exact science as you're interpolating based on the letters being equal.

Let's say you now have a lower bound that is below your target, and an upper bound that is above...you now look at the value of the second character and work with those two bounds and do the same thing.

If you don't have a valid lower/upper bound, you can set the "number of array elements" to the lower bound (if both lower and upper bound are above your target...obviously you set the starting point to the upper bound if both are below it) and redo the test.

Just to quickly point out, you don't always use 0-255 or 1-26...you base lower and upper bound on the value of the character in the string at the first and last pointer in the array.

Think of interpolation search as an intelligent binary search, using logic to choose where to look next.

However, if you're willing to use a little memory, storing them in a comma-delimited string and using instr() would probably be a hell of a lot easier, it's usually what most people do...it's also pretty fast, though probably not if you plan to change the contents of the list often...it's good if you are comparing against a static list :-)

3. ## Re: Interpolation Search

Yes, no and maybe, potentially. It depends...

Yes: If your "strings" are LONG numbers, than you just need to convert the List(strings) into Temp(longs), with matching index values. If you find the solution in the List(longs), then pull your answer from List(strings). As opposed to the possibly slower "conversion" of a Long to a String. (Conversion of a number to a string adds a space which has to be removed. Str(43) would become " 43", as a string. As opposed to your List(strings), which is already in your desired format.)

No: If your strings are ASCII, this will not help much. You would have to MATH the string into a large number that will surely be larger than anything VB6 can handle as a non-string number format. (A binary search is faster because it views strings as individual numbers for comparison, not "relative ASCII patterns". Where "Cat" could = "cat", "caT", "CAT", "cAT". In binary, those are all different "numeric values". That is why binary search is faster, in some aspects, but not others.)

"Cat" as an actual number, not a set of numbers, is a whole other ball-game.
As a set, "Cat" = 67, 97, 116 (Asc("C"), Asc("a"), Asc("t"))
As a single number, inverse-order, that is..
"Cat"...
"C" = 67 x 1 = 67
"a" = 97 x 256 = 24832
"t" = 116 x 65536 = 7602176
Total = 7627075

As opposed to "cAT"
"c" = 99 x 1 = 99
"A" = 65 x 256 = 16640
"T" = 84 x 65536 = 5505024
Total = 5521763

"Cat" != "cAT"
"Cat" > "cAT" *(Unless you reverse the order, then it would be "Cat" < "cAT", because "c" < "C")

That is just a simple 3 letter word. You notice that the multiple (power of 256^x) grows a bit quick, to make the string as a true "single number 7627075", as opposed to a "set of numbers 67, 97, 116".

Maybe: On a side note, this is what actually helps make databases fast, with some indexes/glossary/dicts. They take the first three string values and turn them into a single number. Forced as "lower-case" and also "raw", to aid in "fast text searches".

NOTE: Order in VB6 would not matter, because there is no fast way to convert that number into a string. This is due to an inability to do any "bit-shifting", at all. However, if you only have 4-character strings, there is a cool trick using "copy-memory" and RGB(string), which will actually convert the string into an RGBA "long" value, from the four individual bytes of the string. Though VB6 doesn't use the "A" in RGBA, in the RGB() function... it is still there and changes the "Long" value that it results in.

4. ## Re: Interpolation Search

If you want to get a bit more involved for a speedier result, the use of a static value of 1 per character (if you are searching through English words) is inefficient as certain letters (Q, J, Z, etc) are less likely and vowels (A, E, I, O, U, and sometimes Y so you might want to include that) are more prevalent. If you assign values to each letter based on this you should get more accurate pointers.

If your string is static, then pre-generating the starting pointer for AA to ZZ into an array would be a good idea...with this, you can instantly cut down your search parameters by a factor of 26*26 (676) which would mean going from 1m words down to a subsection of only 1480 words (I'm sure you'll agree that sounds worth it). Pre-generating AAA to ZZZ would bring that number down to an even more manageable 57 (on average...probably more likely to be over 100 but still far lower than 1m). This would give a starting pointer AND ending pointer (as you use the starting pointer for the next value and deduct 1) that you can then drill down with using whatever search method you want

Edit: The pre-generating can also be done with binary values 0-255 rather than just AA-ZZ, but if you go with 2 byte the array will need to be 65536 values (acceptable, and it reduces search radius by a factor of 65536) and if you go with 3 byte the array will need to be 16.7m values...it's a lot of memory to waste unless you don't need it for anything else.

5. ## Re: Interpolation Search

It could be adapted to strings, but I'm not sure it would work very well. It's not much different from a binary search, except that, when finding the mid-point of the "half" it's working on, it uses linear interpolation rather than just going to the median of that "half".

I say that it wouldn't work very well on strings is because, to do the linear interpolation, you'd need to convert your strings to numbers somehow (maybe average their ANSI values, or some other way), and I seriously doubt that'd make much sense for strings. So, rather than increasing efficiency, it's very likely to actually decrease it. As opposed to an actual array of numbers, strings tend to not have any rhyme-or-reason for taking on various values.

Even with numbers, if the frequency distribution for the array being sorted is fairly flat, an interpolating search is going to be slower than a binary search because of the increased overhead of finding the interpolated center on every "half" cut.

Personally, unless you know your array has "clumps of values", I'd stick with a binary search. Or, if you truly want improved speed, create some kind of hash-bin index for your data, or possibly a btree with more than just two (binary) values in each branch. For most data, the hash-bin index is found to be the fastest.

6. ## Re: Interpolation Search

Originally Posted by Elroy
It could be adapted to strings, but I'm not sure it would work very well. It's not much different from a binary search, except that, when finding the mid-point of the "half" it's working on, it uses linear interpolation rather than just going to the median of that "half".

I say that it wouldn't work very well on strings is because, to do the linear interpolation, you'd need to convert your strings to numbers somehow (maybe average their ANSI values, or some other way), and I seriously doubt that'd make much sense for strings. So, rather than increasing efficiency, it's very likely to actually decrease it. As opposed to an actual array of numbers, strings tend to not have any rhyme-or-reason for taking on various values.
I disagree...as I said above, a string is nothing more than numbers underneath...if it was text, ABC is lower than BCD because there is an inherent value for the characters. All interpolation is doing is calculating an estimation of where the required character would be assuming a flat distribution of values, which gets it far closer than binary search would manage. It is likely to perform far better on larger amounts of data, as it drastically reduces the scope of the search each time rather than halving it each time (1m values would become 500k, 250k, 125k, 62.5k, 31.25k etc, if using binary...with interpolation it could go from 1m to anything between 200k-800k, but quickly drops the scope after that...at the same point that binary has 31.25k, I'd say interpolation could be in a 1000 or lower range as it is intelligently looking).

I'd be inclined to say that if it was trying to work with binary strings (0-255) then it could intelligently get the scope down to ~1000 in an average of 4-5 checks (check 1 looks at the value of the first char and divides the data into 256 blocks then goes to that char's pointer. If it is higher (let's say it was looking for char value 64, and found 70), it moves down by how many higher it is (back 6 blocks in this example), keeps repeating that until it finds the ballpark area it is looking for. Getting the scope down even further by using a binary search at THIS point using the range already calculated, it shouldn't take more than 10-15 reads of a string.

I considered working on trying to come up with an example, but am too busy with my own app to do it :-)

7. ## Re: Interpolation Search

Originally Posted by SmUX2k
... a string is nothing more than numbers underneath...
I don't disagree, but "getting" that number is possibly going to be expensive (CPU-wise), depending on how you do it. If you're willing to work with only the first 4 characters of a string, I suppose you could use StrPtr() and GetMem8, and quickly copy those 4 characters into a Currency type. And then use the Currency types for your comparisons.

That does assume that all your VB6 strings will be at least 3 characters (knowing they're UCS-2 unicode, and always terminated with two 0x00 bytes). It also assumes a straight-up bitwise comparison of your strings will be ok (which often isn't what's wanted with strings).

8. ## Re: Interpolation Search

Elroy, have you ever looked at Radix sort? I know it isn't related to this topic, but how it *works* is closely related to this. With a Radix sort the data is sorted by the least significant number (the digit), and then sorted by the next most significant number (the tens) for each of the digits and then the next most significant (the 100s) etc.

We've already established that the data is sorted, so when we're seeking through the arrays to find the golden value we don't NEED an all-encompassing one number per string, we just need to look at the least significant character (in numbers it's the last, in strings it's the first). Let's say we have 26,000 entries, and the entries are all A-Z (uppercase alphabet). Let's say we were looking for "GIANT"...by my logic, we would divide 26,000 by 26 (can see why I chose that number for the example) and multiply it by the "value" of the first character in GIANT...G = 7. So, we jump to the 7000th entry in the array and look at the least significant byte. Let's say it's E...so we add 2000 to the jump and go to the 9000th entry and SHOULD be closer to G. Let's say it's H...starting to get a little complicated now as E is at 7000 and H is at 9000 so we need to divide 2000 by 3 (666) and multiply by 2 to get to where G should logically be between E and H. Around 8332 SHOULD be G. We're in the ballpark area now for where we need to be looking, and it shouldn't take too many educated guesses to jump around between 7000-9000 to find where the start and end of G is...and while we're doing this we could be comparing these pointers to the required target.

My point here is you don't NEED to work with any more than 1 character in a string at a time...just 1 character can get your scope down considerably, and each further character brings it down exponentially. You could have (in theory) 500m entries and it would take less than a couple of hundred bounces around to get the scope down below a range of ~500 or less. I would even dare to suggest that it would only take about 50 bounces to achieve that level of accuracy as long as all 500m strings are in order.

As I said above, I would write an example but I'm busy with my app...and I am actually getting somewhere extremely slowly (further ahead than previously, which is good!). Perhaps if I feel up to a break from my app tomorrow I will code an example :-)

9. ## Re: Interpolation Search

Interpolation-Search can "blow up in your face" timing-wise (compared to "straight halfing the interval"),
when your "KeyData" is not equally distributed.

E.g. when you have 100000 Keys, which are prefixed with "Key_" (Key_1 to Key_100000).
Or when you have an ISO-Date-String (YYYY-MM-DD hh:nn:ss) as Key, with 100000 entries (all in the same year and month).

With a plain binary search, you are guaranteed to end the search -
after max 17 steps (for 100000 entries) no matter how your Keys are distributed -

With Interpolation-search you might need many, many more steps -
when your Keys are relatively constant in the first few chars (like e.g. Iso-Date-Strings).

Olaf

10. ## Re: Interpolation Search

Originally Posted by Schmidt
Interpolation-Search can "blow up in your face" timing-wise (compared to "straight halfing the interval"),
when your "KeyData" is not equally distributed.

E.g. when you have 100000 Keys, which are prefixed with "Key_" (Key_1 to Key_100000).
You've piqued my interest, see below

Originally Posted by Schmidt
Or when you have an ISO-Date-String (YYYY-MM-DD hh:nn:ss) as Key, with 100000 entries (all in the same year and month).
You mean like...I dunno...stock tick data?

Originally Posted by Schmidt
With a plain binary search, you are guaranteed to end the search -
after max 17 steps (for 100000 entries) no matter how your Keys are distributed -

With Interpolation-search you might need many, many more steps -
when your Keys are relatively constant in the first few chars (like e.g. Iso-Date-Strings).
You've piqued my interest here, mostly because my iteration of the algorithm wouldn't fall foul of this problem. The comparison is single char and moves along the entry, so a start point of 2022-02-01 10:30:00 and end point of 2022-02-01 10:30:01 wouldn't cause any issues because the algorithm would look at the first char and see it's the same, and keep moving on a char until it found a difference. You MIGHT call each of these checks a "step", in which case it will fail with a high score, but I would also argue then that a comparison check of entry against target should be scored based on the number of characters in it :-)

The reason why this comparison is important is because of what importance is assigned to the difference in the byte and how if the bytes were A and Z there would potentially be a reduction in the range of results by 1 in 26 (or more as it depends on the character set being used in the data). Whereas the binary search algorithm halves it in one iteration, interpolation has a good chance of reducing it multitudes further with some slightly more complicated steps. Sure, binary is reliable in its results, every iteration halves the target, but this is why I said use the interpolation to get the dataset down to a much smaller value then use binary search to finish the job...it is probably more reliable in the details.

I'm interested though (mostly because of stock data, though I don't really search strings like that...too memory intensive :-) ) and may look into my iteration later as I had lots of progress yesterday with my app (redoing the processing code, and it indexes split points for a 2.5GB file in ~4 seconds then loads each split point of data, each day, in a total of 2 seconds...this is from ram drive though)...I have a basic idea of what it needs to do.

11. ## Re: Interpolation Search

Truth be told, just thinking about this ... I could see where "finding the first search pivot point" using linear interpolation with a (maybe 10%) bias toward the center could be useful. And then binary search (i.e., halving to find subsequent pivot points) from that point forward.

Done that way, if your distribution is fairly flat, you'd approximately go straight to the item you're searching on the first try, and then binary search only a small "half". But, that 10% bias toward the center would be important so you didn't wind up searching the larger half after the first pivot point.

Done that way, you'd get a bit of the benefit of both approaches, and it wouldn't ever "blow up in your face", as Olaf says.

But even with that, you're only going to reduce your "looks" (i.e., comparisons) by only one at the most. And most of the "looks" happen when you're zeroing in, rather than when you're getting started.

Yeah, I'll stick with the idea that binary searches are pretty good. And if I need something more sophisticated, I'll start looking into btree (multiple nodes in each branch) or hash table lookups.

12. ## Re: Interpolation Search

Ohhh, here's something interesting. I just did this in Gemini:

My Question: What mechanism for indexing does MS Access use?

MS Access utilizes a type of indexing known as B-Tree indexing for its tables. This method structures data in a hierarchical tree-like format, allowing for efficient searching and sorting operations. Here's a breakdown of how it works:

Data Organization: Each level of the B-Tree holds data entries. These entries consist of a search key (the indexed field value) and a pointer to the actual data record's location.

Searching: When searching for a specific value, the B-Tree is traversed from the root level downwards. At each level, the search key is compared to the keys in the entries. If a match is found, the pointer directs access to the desired record. If not, the search continues down the appropriate branch based on the comparison result.

Benefits: B-Tree indexing offers several advantages:

Fast Searching: Compared to scanning the entire table, navigating the B-Tree structure significantly reduces the time to locate specific records.
Efficient Sorting: B-Trees can be traversed in a specific order to retrieve data in a sorted manner, enhancing the performance of sorting operations based on the indexed field(s).
I also asked about Oracle SQL servers, and it said they use B-Tree as well.

13. ## Re: Interpolation Search

Originally Posted by Elroy
Truth be told, just thinking about this ... I could see where "finding the first search pivot point" using linear interpolation with a (maybe 10%) bias toward the center could be useful. And then binary search (i.e., halving to find subsequent pivot points) from that point forward.

Done that way, if your distribution is fairly flat, you'd approximately go straight to the item you're searching on the first try, and then binary search only a small "half". But, that 10% bias toward the center would be important so you didn't wind up searching the larger half after the first pivot point.
Not sure if you "get" interpolation...you've just described a binary search with a bias...you mention halving the search range. If the data was capitalised alphabet only, you would divide the range down by a factor of 26, not 2, on your first check (if it actually hits its target...but it would take binary search 5 tries to reach a slightly better result (a factor of 32) so you have another 4 tries to narrow down the location before it becomes less efficient. Imagine how much it would divide the range down if it was binary (easy one, that...a factor of 256, in theory) :-)

Originally Posted by Elroy
Done that way, you'd get a bit of the benefit of both approaches, and it wouldn't ever "blow up in your face", as Olaf says.

But even with that, you're only going to reduce your "looks" (i.e., comparisons) by only one at the most. And most of the "looks" happen when you're zeroing in, rather than when you're getting started.
The zeroing in is the real problem, and I suspect interpolation isn't going to cut it there...I would rather use binary search past a certain point...but no, it should reduce the number of looks by a decent percentage. With interpolation you don't have a guarantee on the minimum number of looks needed, while with binary you are pretty definitely going to always halve the range with each check so it has a more reliable limit...I suspect that with interpolation the number of looks will be less more often. Possibly even fewer if it's a mix of the two methods.

Don't worry though, when I do decide to code it I will design a benchmarking system that will test every single target position and calculate how many looks (and, more importantly, how quickly) it takes each time for the two different search methods...and we can always add others if people want to see how their search algorithms stack up against them both :-)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•