Let's face it: VB's Collection is buggy. At least on my machine under Windows 8, using VB6 Professional. You're welcome to check, if this is also the case for yourself, on your platform.
Here's how I can reproducibly enforce an error on my platform.
(1) Create a dynamic 0-based string array with n elements (0...n-1).
(2) Create a new collection.
(3) Fill the string array with random 96 bit sequences by concatenating 6 Unicode characters of: ChrW(Int(Rnd() * 65535))
(4) Add into the collection all n elements and give as key the random 96 bit sequences stored in the string array.
(5) Retrieve all n elements from the collection by key as per the string array.
Here's how to reproduce the error:
(1) Let the number of items be n=13366.
(2) Run the shown code in the IDE. The final MsgBox will pop up and say "Done".
(3) Compile into an exe.
(4) Run the compilation. The final MsgBox will pop up and say "Done".
So, all is well. Or isn't it?
(1) Let the number of items now be n=13367 (one more than before).
(2) Run the shown code in the IDE. The final MsgBox will pop up and say "Done".
(3) Compile into an exe.
(4) Run the compilation. Runtime error 5 will be raised.
Obviously, the 13367th item (i=13366) causes a problem.
To find out, what element triggers the RTE, uncomment the 3 commented lines at position 1 in the Main routine. They will produce a log file on your desktop, first writing the key written last, then logging what is attempted to be retrieved, and after successful retrieval reporting the success. My log file looks like this:
No success on item 4217 (the 4218th item), here a run time error 5 is raised. But in the compilation only, in the IDE all is well.
Here's the code to check if you can reproduce this error also on your platform.
In the LOGFILE constant, you need to replace ... by the user name to point your desktop directory. (Right-click on any item on your desktop and choose Properties to find the location). The code will then produce a text file named Log.txt on your desktop, which you can open with any editor.
Enjoy.
Code:
Option Explicit
Private Const ITEMS As Long = 13367
'Adding 13367+ items (i=0...13366+) as below produces an access error in the
'4218th element (i=4217), adding less items grants access to all of them.
'This is element i=13366: 0x0736 1196 B6D2 EA09 97A0 A928
'This is element i= 4217: 0x1192 BB0B 80D9 D8BA 0534 5374
Private Const LOGFILE As String = "C:\Users\...\Desktop\Log.txt"
Private oColl As Collection
Private i As Long, j As Long
Private sKey As String
Private asKeys() As String
Sub Main()
ReDim asKeys(0 To ITEMS - 1) As String
Set oColl = New Collection
'Generating ITEMS keys.
For i = 0 To ITEMS - 1
For j = 1 To 6
asKeys(i) = asKeys(i) & ChrW(Int(Rnd() * 65535))
Next j
Next i
'Adding ITEMS key into collection.
For i = 0 To ITEMS - 1
' If i = 13366 Then LogCause 'Log error cause item i = 13366.
oColl.Add i, asKeys(i)
Next i
'Retrieving all keys from collection.
For i = 0 To ITEMS - 1
' LogStart 'Log access attempt.
oColl.Item (asKeys(i)) 'Attempt to access.
' LogSuccess 'Log successful attempt.
Next i
Set oColl = Nothing
MsgBox "Done"
End Sub
Sub LogCause()
Open LOGFILE For Append As #1
Print #1, i, "Error cause adding this item: ";
For j = 1 To 6
Print #1, Right("000" & Hex$(AscW(Mid$(asKeys(i), j, 1))), 4); " ";
Next j
Print #1,
Close 1
End Sub
Sub LogStart()
Open LOGFILE For Append As #1
Print #1, i, "Attempt to access: ";
For j = 1 To 6
Print #1, Right("000" & Hex$(AscW(Mid$(asKeys(i), j, 1))), 4); " ";
Next j
Print #1, "- ";
Close 1
End Sub
Sub LogSuccess()
Open LOGFILE For Append As #1
Print #1, "Success"
Close 1
End Sub
I use collections quite a bit, and I've never had any problems with them.
I set up your code on my computer (Intel i7, Win7-64-bit) and it ran fine, both IDE and compiled.
I'm not really up for tweaking it and trying to reproduce a similar problem. But I do believe I know what happened. We can both agree that the Collection Key must be unique, right? Quote from MSDN:
key
Optional. A unique string expression that specifies a key string that can be used, instead of a positional index, to access a member of the collection.
Yes, I believe that you attempted to add a key that was a duplicate.
You might ask how there's a difference between the IDE and an executable. Well, let's look at a piece of your code...
Code:
...ChrW(Int(Rnd() * 65535))
Alright, Rnd() is suppose to return the same sequence of numbers, and I believe that it does. And that's true whether it's compiled or not. However, what it returns is an IEEE single. Therefore, when the multiplication is done, it's going to typecast your 65535 to an IEEE single, and then do the multiplication. This is where differences can come in. I'm not entirely certain how the IDE does IEEE math, but I suspect it uses the Floating-Point-Coprocessor that's built into all contemporary CPUs. However, a compiled VB6 program may take certain steps that cause small differences. Most specifically, there's a default "Safe Pentium FDIV Check" that's built into VB6. This was an old floating point bug that was in some old Pentium CPU chips. You can turn this off under "Advanced Optimizations". There are also a couple of other settings which may have slight changes in the way floating point math is done.
Ok, so how does this affect what you're doing? Well, when Rnd() * 65535 results in a number that's very close to an integer, there may be differences between the IDE and compiled. For instance, if the IDE returns 356.9999! and compiled returns 357.0001!, the Int() function will return different values. Therefore, there can be differences.
This would all be easy to check. When you get an error, report your key, and then attempt to use that key to see if the item already exists in the collection. I'll bet that you'll find that it does.
Also, you might say that you're producing a random twelve-byte sequence to serve as the keys, so how can there be a duplicate after only 13367 attempts. I'm also not going to work out the factorials, but what people don't realize is that it's much easier than people think to find duplicates when you're allowing any two to count. For instance, how many people do you think you need to have in a room to have a better than 50/50 chance that two will have the same birthday? The answer is 23 people. If you get 75 people in the room, there's better than a 99.9% chance that two will have the same birthday.
Therefore, I'm thinking the bug has nothing to do with Collections. In fact, technically, if what I'm saying turns out to be correct, there's no bug at all.
Regards,
Elroy
EDIT1: Also, just staring at that Int(Rnd() * 65535) line, you're doing two typecasts in there. One to typecast 65535 from a Long to a Single. And then, after the multiplication is done, Int() is expecting a Double, so the product will then be typecast from a Single to a Double. To make things more precise, you might want to specify 65535 as a Double (65535#). And then, to be a bit more explicit, you may want to force-typecast Rnd() as a Double (CDbl(Rnd())). And also, since: 0<= Rnd() < 1, i.e., Rnd() never returns ONE, but can return ZERO, the constant literal should be 65536, not 65535. So, putting all that together, things would be better if you had: Int(CDbl(Rnd()) * 65536#)
EDIT2: I don't think this is a problem, but it also concerns me that we're using the entire bit-range of Unicode. VB6 uses the UCS-2 subset of UTF-16 Unicode. This is a subset that has only-and-all two byte characters. However, in this UCS-2 subset, the "official" range of characters is covered in hex-codes from &h0001 to &hD7FF. &h0000 is specifically reserved for an end-of-string indicator, and anything from &hD800 to &hFFFF is outside of the UCS-2 range, and is possibly used for characters that are longer than two bytes. I seriously doubt that Collection keys examine the actual Unicode characters (preferring to just treat them as bits), but I'm not absolutely certain about this.
Last edited by Elroy; Aug 25th, 2016 at 09:14 AM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
I can't see how the hardware or the OS makes any difference here and it is a bit shocking to think people even suspect those as factors at all. Collections are so widely used in VB6 programs and VB6 programs have been in broad and heavy use for so many years any such "flaw" would be well known by now.
If anything the failing test described so chaotically above probably stems from misunderstanding and misuse. Collection key values are only defined for case-insensitive printable text strings anyway. Trying to use random garbage as keys could fail in any number of ways, including far more key duplication than you might think due to the way raw keys are processed for case-insensitivity before hashing.
I've never used collections in a way that case insensitivity is a problem, but that clearly shows that they're looking at characters and not just bits, when it comes to the keys.
Therefore, anytime you produce a ChrW$(&h0), there's going to be a potential problem. Assuming you're spanning the entire &h0 to &hFFFF range and producing six of these per loop, this will happen approximately once in every 10922 iterations. That's not terribly far off from when you say it fails.
Also, all the uppercase/lowercase conversion it's going to be doing within the entire Unicode range, who knows how many problems this is producing.
To produce random keys worth anything at all, you need to adjust your loop to look something like the following:
Code:
For j = 1 To 6
asKeys(i) = asKeys(i) & ChrW(Int(CDbl(Rnd()) * CDbl(&HD7FF)) + 1)
' where Int(CDbl(Rnd()) * CDbl(&HD7FF)) + 1 will return &h0001 to &hD7FF (Valid two-byte VB6 Unicode range)
Next j
asKeys(i) = UCase$(asKeys(i))
Also, the CPU you actually have will have nothing to do with whether or not the VB6 compiler accommodates for the Pentium FDIV bug. Let's not forget that VB6 is compiling programs for potential distribution (possibly onto Pentiums).
I'm with dilettante regarding any discussions of OSs or CPUs. I don't see how this could have anything to do with anything.
Regarding the actual error, can you provide us with a piece of code that produces the error for you? As stated above, I copied and pasted your code in post #2 and got no error, IDE nor compiled. Are you saying that that exact piece of code raises an error on you computer?
Regards,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
I doubt that control characters get stripped but there is no documentation saying how they get treated. For all we know the first step might be to convert to lowercase in a way that expands any ligatures and then converts to ANSI before hashing.
Possible use of ANSI might be one reason why Collection objects are not persistable, since moving a persisted Collection between locales could wreak havoc.
If you want some sort of hash table that can accept arbitrary binary keys either convert your keys to hex or Base32 strings or else look at one of the many other hash table implementations available.
Also, dilettante, you're right. I didn't think about control characters. Therefore, it'd be better if we added &h20 to get things off of them. The ChrW() would become:
Code:
ChrW(Int(CDbl(Rnd()) * CDbl(&HD7E0)) + &h20)
I've never messed with it that much in VB6, but here's a link to a BASE64 encode/decode routine. I'm sure there are others, and I'd certainly test well before using any of these, but this would solve the problem of ucase/lcase and non-allowed characters. Using this, you could go back to the &h0 to &HFFFF range.
EDIT1: Again, Donar, you haven't shown us a problem with Collections. What you have illustrated is that Collection keys are truly treated as strings of Unicode characters (and not just as bits). Therefore, we must be somewhat careful when constructing these strings to be used as keys.
Last edited by Elroy; Aug 25th, 2016 at 09:55 AM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Ahhh, yep, that's right. These Collection keys actually are rather restrictive when you start getting into it. Just thinking about it, a very simple, although not quite as efficient, way to do it would be to use BASE16, which is essentially the Hex$() function. BASE32, as does BASE64, splits-and-spans bytes. Whereas BASE16 works quite nicely with bytes, with each BASE16 character making up a nibble.
Just because it was easy to do, and that I neglected to realize that BASE64 still uses upper & lower case, I knocked out a BASE16 procedure. Also, I did a quick search for a BASE32 procedure and didn't find one (and didn't want to write one and have to deal with slicing and dicing bytes):
Code:
Option Explicit
Public Function Base16Encode(s As String) As String
' This is just a "quick" base 16 procedure.
' Regarding high-low memory, it reverses the nibbles in the characters,
' but that doesn't matter so long as the decoder knows this.
Dim i As Long
For i = 1 To Len(s)
Base16Encode = Base16Encode & Right$("0000" & Hex$(AscW(Mid$(s, i, 1))), 4)
Next i
End Function
Public Function Base16Decode(s As String) As String
' Designed to be used only with Base16Encode.
Dim i As Long
For i = 1 To Len(s) Step 4
Base16Decode = Base16Decode & ChrW$(Val("&h" & Mid$(s, i, 4)))
Next i
End Function
Private Sub Form_Load()
' quick test
Dim s As String
s = "asdf" & ChrW(&HFEDC)
MsgBox s
s = Base16Encode(s)
MsgBox s
s = Base16Decode(s)
MsgBox s
End Sub
Donar, use those as wrappers (especially Base16Encode) when putting keys into a collection, and/or using a key to fetch an item. If you wish to loop through the collection and see the keys, you'll then use Base16Decode.
All of this will, at least, allow you to know more clearly when you've hit a duplicate key, especially if you look at the BASE16 encoded values when inspecting for duplicates.
Regards,
Elroy
Last edited by Elroy; Aug 25th, 2016 at 11:15 AM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Well I'm not sure we have any clue what the goal was (if any). Perhaps it was about 96 bit binary keys, or perhaps that was just an experiment.
I'm not a big fan of hex for that since it might not make good use of the hashing algorithm used and the longer keys it produces might have an impact on hash collisions too. If there is a real need for a hashed-access data structure accepting 96 bit binary keys there are lots of samples that could be modified for that.
But if there is a Collection flaw we've all missed for 18 years it would be nice to know about it.
@dilettante: "hashing algorithm?" You think collection keys are hashed?
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Hmmm, well, LaVolpe figured out how to retrieve the keys. So, if they are hashed (which I doubt), it's certainly not a lossy hash.
Code:
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource As Any, ByVal ByteLen As Long)
Public Function CollectionItemKey(idx As Long, Coll As Collection) As String
' Original by LaVolpe.
Dim i As Long
Dim ptr As Long
Dim sKey As String
'
If Coll Is Nothing Then
Err.Raise 91
Exit Function
End If
'
If idx < 1 Or idx > Coll.Count Then
Err.Raise 9
Exit Function
End If
'
If idx <= Coll.Count / 2 Then
CopyMemory ptr, ByVal ObjPtr(Coll) + 24, 4
For i = 2 To idx
CopyMemory ptr, ByVal ptr + 24, 4
Next i
Else
CopyMemory ptr, ByVal ObjPtr(Coll) + 28, 4
For i = Coll.Count - 1 To idx Step -1
CopyMemory ptr, ByVal ptr + 20, 4
Next i
End If
'
i = StrPtr(sKey)
CopyMemory ByVal VarPtr(sKey), ByVal ptr + 16, 4
CollectionItemKey = sKey
CopyMemory ByVal VarPtr(sKey), i, 4
End Function
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
define lossy hash? if there's a hash collision, you just check the next entry for an exact hit.
fyi, Lavolpe didn't come up with how to collection retrieve keys... I have no idea who the first person to dissect the internals was.
Code:
Private Type VBCOLLECTION
pVTable As Long
Unk0 As Long
Unk1 As Long
Unk2 As Long
Count As Long
Unk3 As Long
First As Long
Last As Long
End Type
Private Type VBCOLLECTIONITEM
Data As Variant
Key As String
Prev As Long
Next As Long
End Type
an easy way to test if it's a hash, is to check rerieval time via key, vs how many items are in the collection.
edit:
Code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' Collection Helpers '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Public Function Keys(Col As Collection) As Variant()
If (Col.Count = 0) Then Exit Function
Dim ColInternal As VBCOLLECTION
CopyMemory ColInternal, ByVal ObjPtr(Col), LenB(ColInternal)
ReDim RetKeys(1 To ColInternal.Count) As Variant
Dim ColItem() As VBCOLLECTIONITEM
With New CFixed: Call .Init(VarPtrArray(ColItem), ColInternal.First)
Dim ItemIndex As Long
For ItemIndex = 1 To ColInternal.Count
If (StrPtr(ColItem(0).Key)) Then
RetKeys(ItemIndex) = ColItem(0).Key
Else
RetKeys(ItemIndex) = ItemIndex
End If
.Ref = ColItem(0).Next
Next
End With
Keys = RetKeys
End Function
Last edited by DEXWERX; Aug 25th, 2016 at 02:42 PM.
A way of getting a number from an object so it can be stored in a Hashtable.
If the number isn't allowed to infinitely grow, but the string from which it's derived is, then it will, by definition be a lossy hashcode.
I see a hash code as a form of compression. It's taking a large chunk of data and making it into something that's smaller and more manageable, and probably numeric. There would be several ways to go about doing this, some lossy and some lossless.
I was first introduced to the Lossy and Lossless terms with respect to sound, image, and video compression. For instance, with respect to images, bitmaps are lossless and jpegs are lossy. In other words, once you convert a bitmap to a jpeg, there's no way to the the absolute same bitmap back from that jpeg. However, if you turn a jpeg into a bitmap, and then back into a jpeg again, if the exact same encoding/decoding is used, you can get back to the same jpeg because a bitmap is lossless.
However, DEXWERX, I suspect you may know all of that, and it's strange that you asked.
Therefore, if Collection keys are using a hash code (which I seriously doubt), then it's a lossless hashcode because it's possible to recover the precise string that went into making it.
Regarding credit to LaVolpe, all I know is that the code I have has his name attached to it as the author. Maybe he'll come back around and affirm-or-deny his authorship.
Regards,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
ah... The piece your missing is how a hashtable is implemented using a hash function. You can implement a hashtable just fine with a very lossy hash function. The confusion comes from what you are calling a hash, and what others are calling a hash.
a Hash Code, a Hash Function, or a Hash Table which is what we are talking about when we say the VB Collection class is a Hash (table)... https://en.wikipedia.org/wiki/Hash_table
I Do remember someone from Microsoft in some thread saying the VB Collection is a hashtable, and that the Dictionary class was created to address some of it's shortcomings, but really just kind of did it half way... We were going to get a new Hashtable in VB7, but obviously that never happened. *shrugs* I'm also not sure if any of the VB7 Hashtable code made it's way into the Scripting Dictionary.
I personally just use my own functions to fill in some of functionality missing in Collection, instead of using a Dictionary.
Last edited by DEXWERX; Aug 25th, 2016 at 03:36 PM.
@DEXWERX: I suppose, if your look-up routines are smart enough. However, if we're always going back to double-check the actual key (in its string form), progressing until we find an exact match, then there's no problem with using BASE16, which is what started this discussion in the first place. DEXWERX, this entire thread is about something that's "supposedly" broken in Collections.
I've yet to see a piece of code that breaks them on my computer. And I've certainly yet to see a piece of code that would conclusively show a problem with them that's not related to duplicate keys (while still considering that these keys are case insensitive, and possibly not liking things out of a valid Unicode-character range). If hashcodes are to be used for sorting/matching purposes, there's sure no evidence of it in the structures you posted. There is a doubly-linked list (to make insertions and deletions easy, although binary searches more difficult), along with the data and the key-string, and that's about it.
Therefore, all this discussion of hash-codes is off-point, and we just need to provide Donar with a way to produce keys that can clearly be "seen" as a duplicate of another, or not; and that make this obvious by excluding lower-case characters, and any other non-allowed Unicode characters.
Sighs,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Therefore, all this discussion of hash-codes is off-point
You say (or doubt) keys are not hashed so don't upset if someone corrects you.
When you access a Visual Basic collection using a key, a hashing algorithm is used to determine the location of the associated record http://epaperpress.com/vbhash/
Originally Posted by Elroy
Regarding credit to LaVolpe, all I know is that the code I have has his name attached to it as the author. Maybe he'll come back around and affirm-or-deny his authorship.
@Elroy - Misunderstanding all around it seems. Not sure what you mean by "I suppose if your look-up routines are smart enough."
Are you talking in the proverbial sense? in which case - if you look at the Wikipedia link, there are a large number of strategies dealing with hash collisions.
Or are you talking in the personal sense? in which case I use the collection class, specifically so I don't have write any look-up routines?
As for this thread - we're having the most interesting part of the conversation, which is hopefully helpful to the OP, who seems to be a beginning programmer.
The most relevant answers to this thread were posted by Dil, if you want to use a collection class with binary keys - you have to convert the keys to a unique case insensitive unicode string (Hex/Base32 being the easiest to implement). <-- are you calling Hex and Base32 a hash? (They're not, just trying to clarify)
It's great that the OP is coming up with ideas and ways for testing his understanding of the Collection class, but the test is flawed.
Random Chars is not a Unicode string, despite being stored in a String variable. The Collection class wasn't made to handle malformed Unicode.
The Collection class expects printable Unicode character strings, and probably calls API functions to coalesce equivalent printable characters (including case) before hashing.
Maybe he can take the new understanding and create tests for it.
NOTE: What the OP did essentially was Fuzz the collection input. A great way to test for vulnerabilities.
Last edited by DEXWERX; Aug 25th, 2016 at 04:54 PM.
there are a large number of strategies dealing with hash collisions
Yes, precisely. Dex, it just seemed that this thread was going off-the-rails. After a few posts, I assumed you were talking about a hash table that basically had an array of a structure that looked something like the following:
Code:
Type HashItem
HashCode As Long
OrigKey As String
DataStuff As Variant
End Type
And then, we would keep this structure sorted based on HashCode, even when new entries were entered, and/or old ones deleted. And then, to search, we could use binary search (because it's sorted), but then we'd have to back up to make sure we were at the beginning of any duplicate HashCode values, and then traverse forward to see if we could find a matching OrigKey (or get a different HashCode).
That's all certainly one way to use a Hash Code Table. If used in that way, and our algorithms to make a HashCode from OrigKey honored our case-insensitive and possibly no-control-characters requirements, then yes, it could work with Collections.
And that's what I mean by "look-up routines being smart enough".
However, I've seen no evidence that that's the way it's working at all.
And, oh gosh, no. I never meant to imply that BASE32 (or any other base) is a hash code. In some ways, it's the opposite. But basically, it's just different.
Originally Posted by DEXWERX
The Collection class wasn't made to handle malformed Unicode.
Now that's the point I've been attempting to make all along. And, to some degree, what dilettante and I were discussing is how to take an ill-formed Collection key and turn it into a correctly-formed Collection key. He recommended BASE32, and I placed a link for BASE64. dilettante correctly pointed out that BASE64 doesn't solve the problem because it still uses both Upper-and-Lower.
Then, as an easy solution, I recommended BASE16, and also provided some functions to do that. dilettante had some strange objection to BASE16, saying they may not play well with the Collections hash table.
Originally Posted by dilettante
I'm not a big fan of hex for that since it might not make good use of the hashing algorithm used and the longer keys it produces might have an impact on hash collisions too.
This comment was quite puzzling to me, and it's what launched us into a discussion of hash codes.
Sorry to summarize the thread, but it just seemed necessary.
Okay, now, if we all would truly like to do something constructive, we could nail-down what Unicode (UCS-2) characters can safely be used in a Collection key without fear of a collision with another key that's not actually binary identical (such as "a" vs "A"). Also, if we could identify any Unicode (UCS-2) characters that may cause problems in Collection keys (such as &h0000, anything between &h0001 and &h0019, or possibly anything above &hD7FF), that would be useful.
Clearly, the OP is generating a great many keys in these potentially problem areas (or, to use your words, "Fuzz the collection input"). IMHO, Collections are working quite fine. It's just that their keys are truly viewed as strings-of-characters (with some qualifications on what's valid and what's a duplication), rather than just bits. Yes, to nail this down would be useful.
Regards,
Elroy
Last edited by Elroy; Aug 25th, 2016 at 05:31 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Okay, now, if we all would truly like to do something constructive, we could nail-down what Unicode (UCS-2) characters can safely be used in a Collection key without fear of a collision with another key that's not actually binary identical (such as "a" vs "A")
No we can't, it is possible that two different strings generating same hash, it all depend you hashing algorithming how good it is.
Looks you don't understand what hashing mean.
Would someone please show me how hash codes have anything to do with VB6 built-in Collections?
Last edited by Elroy; Aug 25th, 2016 at 06:13 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
I can also reproduce the OP's problem exactly. Copy+pasting his code into a blank project, I also get an inexplicable Error 5 when attempting oColl.Item(asKeys(i)) where i = 4217. As the OP suggested, reducing the number of items added to the collection by 1 (e.g. ITEMS = 13366) makes the error go away (e.g. accessing index 4217 works *just fine*).
So setting aside all discussions of how the Collections class is actually implemented, there's still a good question here, specifically:
1) Why would changing the number of items in a collection cause RTE 5 to appear/disappear on certain .Item() calls?
If this was a key collision problem, wouldn't RTE 457 ("This key is already associated with an element of this collection") be thrown at Collection.Add()? Instead, why is an error thrown during Collection.Item(), and why is it a generic "Invalid procedure call or argument"?
ADDENDUM:
To make this a bit more reproducible, we can use hard-coded Randomize() seeds. Adding "Randomize 4" to the start of the OP's function causes the error to appear at a new index: 2007. The specific key that causes the error is the six-character string (codepoints): 55214, 4548, 49534, 34907, 57762, 15417.
Not all random seeds cause the problem to occur. Checking incremental Randomize seeds, I get the next problematic run with "Randomize 14". That problematic index is 1305, and the problematic string is: 1442, 4569, 58224, 31734, 47744, 22157.
At a glance, I don't see a "magical" Unicode value that might explain the problem, or even a "magical" Unicode range. If VB attempts to perform some kind of Unicode normalization on Key strings (highly doubtful), I'd still expect any problems to occur at the .Add() stage, not the .Item() stage.
Would someone please show me how hash codes have anything to do with VB6 built-in Collections?
Because as far as anyone knows, VB's collection is implemented as a hash-table (or technically, a hybrid hash-table / doubly linked list).
Similarly, for this specific question, changing the number of entries in the collection causes the error to appear/disappear, so a collision explanation (hypothetically) makes a lot of sense.
I suspect that certain Unicode characters somehow cause the hash to be corrupted. The error 5 exception is probably an HRESULT bubbled up from some VB6 runtime or OLE function used to look up an item by key.
I've tried different random number sequences and the problem of course moves. But I can't find a pattern for which characters fail except that they are all > &H00FF. However other characters > &H00FF do not seem to cause a problem.
But changing the keygen character generation to:
AsKeys(I) = AsKeys(I) & Chr$(Int(Rnd() * 256))
... ensures we only use characters from the current ANSI range. This didn't fail for any of the generated keys when I tried it.
Tanner_H, Thank you. At least now I'm getting the error: (Randomize 14, as is clearly seen)
As can clearly be seen, it happens for me on the Add method, and when i=4837. Now, I am getting RTE 457, the same as everyone else.
Now, you say you're getting it at i=1305, and in the Item method.
In an attempt to see this, I reduced the total items to 4000. However, this time, the code ran to completeion ("Done" message).
I'm not going to explore why mine stopped at i=4837. Hopefully, my next post will thoroughly explain that. It's still puzzling to me why I can't reproduce the error on the Item method.
Regards,
Elroy
EDIT1: Just to check a compiled version, I altered the code like this:
Code:
For i = 0 To ITEMS - 1
' If i = 13366 Then LogCause 'Log error cause item i = 13366.
On Error Resume Next
oColl.Add i, asKeys(i)
If Err Then MsgBox i: Stop
Next i
... and I got a message box reporting 4837. Therefore, my IDE executed p-code and compiled are working exactly the same.
EDIT2: Not sure we can call a doubly-linked loop a hash table, but I truly don't want to argue that point.
Last edited by Elroy; Aug 25th, 2016 at 10:02 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
No, it is believed to be a doubly-linked list and a hash table, i.e. two data structures used together internally. We know there is a hash table involved otherwise performance would be far worse than it is for retrieval by key.
Alright, I'm just going to post my particular progress in figuring this out.
Here's how I modified the code, just to consolidate it and also add my own debug functions:
Code:
Option Explicit
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource As Any, ByVal ByteLen As Long)
Private Declare Function OpenClipboard Lib "user32.dll" (ByVal hWnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32.dll" () As Long
Private Declare Function CloseClipboard Lib "user32.dll" () As Long
Private Declare Function GlobalAlloc Lib "kernel32.dll" (ByVal wFlags As Long, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32.dll" (ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32.dll" (ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32.dll" Alias "lstrcpyW" (ByVal lpString1 As Long, ByVal lpString2 As Long) As Long
Private Declare Function SetClipboardData Lib "user32.dll" (ByVal wFormat As Long, ByVal hMem As Long) As Long
'
Private Sub Form_Load()
Dim i As Long, j As Long
Dim sKey As String
Dim asKeys() As String
Dim oColl As Collection
Randomize 14
ReDim asKeys(0 To 4837) As String
Set oColl = New Collection
'Generating ITEMS keys.
For i = 0 To 4837
For j = 1 To 6
asKeys(i) = asKeys(i) & ChrW(Int(Rnd() * 65535))
Next j
Next i
'Adding ITEMS key into collection.
For i = 0 To 4836 ' <----------------------- it will crash on 4837.
oColl.Add i, asKeys(i)
Next i
' NOW DELETE KEYS UNTIL THE ERROR GOES AWAY <--------- FINDING THE DUPLICATE!!!!!!!!!!!
On Error Resume Next
Do
oColl.Add 4837, asKeys(4837)
If Err = 0 Then
Debug.Print HexKey(asKeys(4837))
Debug.Print HexKey(sKey)
Debug.Print "The StrComp: " & StrComp(asKeys(4837), sKey, vbTextCompare)
Stop
End If
sKey = CollectionItemKey(1, oColl)
oColl.Remove 1
Err.Clear
Loop
MsgBox "Done"
End Sub
Function HexKey(sKey As String) As String
Dim j As Long
For j = 1 To 6
HexKey = HexKey & Right("000" & Hex$(AscW(Mid$(sKey, j, 1))), 4) & " "
Next j
End Function
Public Function CollectionItemKey(idx As Long, Coll As Collection) As String
' Original by LaVolpe.
Dim i As Long
Dim ptr As Long
Dim sKey As String
'
If Coll Is Nothing Then
Err.Raise 91
Exit Function
End If
'
If idx < 1 Or idx > Coll.Count Then
Err.Raise 9
Exit Function
End If
'
If idx <= Coll.Count / 2 Then ' Go from beginning or end.
CopyMemory ptr, ByVal ObjPtr(Coll) + &H18, 4 ' VbCollection.First
For i = 2 To idx
CopyMemory ptr, ByVal ptr + &H18, 4 ' VbCollectionItem.Next
Next i
Else ' End is quicker.
CopyMemory ptr, ByVal ObjPtr(Coll) + &H1C, 4 ' VbCollection.Last
For i = Coll.Count - 1 To idx Step -1
CopyMemory ptr, ByVal ptr + &H14, 4 ' VbCollectionItem.Prev
Next i
End If
'
i = StrPtr(sKey)
CopyMemory ByVal VarPtr(sKey), ByVal ptr + 16, 4
CollectionItemKey = sKey
CopyMemory ByVal VarPtr(sKey), i, 4
End Function
Public Property Let UniClipboard(sUniText As String)
' Puts a VB string in the clipboard without converting it to ASCII.
Dim iStrPtr As Long
Dim iLen As Long
Dim iLock As Long
Const GMEM_MOVEABLE As Long = &H2
Const GMEM_ZEROINIT As Long = &H40
Const CF_UNICODETEXT As Long = &HD
'
OpenClipboard 0&
EmptyClipboard
iLen = LenB(sUniText) + 2&
iStrPtr = GlobalAlloc(GMEM_MOVEABLE Or GMEM_ZEROINIT, iLen)
iLock = GlobalLock(iStrPtr)
lstrcpy iLock, StrPtr(sUniText)
GlobalUnlock iStrPtr
SetClipboardData CF_UNICODETEXT, iStrPtr
CloseClipboard
End Property
The code does stop and reports the following in my Immediate window:
Just FYI, a StrComp=0 indicates it sees the two strings as equal.
The first thing I noticed was that D9A2 is outside of the valid Unicode range, and it's also the first character of the key. Therefore, while still in "Stop" mode, I tried this in the Immediate window:
Voila, it still reported equality between the strings. Notice that this was the value I deleted (and not the one I was trying to add).
Now, I did more checking and found that the D9A2 2770 4DE5 1AF0 A28B 2F86 key was created in the key creation loop at i=3730.
It's still a puzzle to me why adding this specific key didn't cause a problem, and that it waited until adding where i=4837 caused the problem.
I must admit that it's still quite curious. The first thing I'm going to do is restrict the Unicode values to values between &h0020 and &hD7FF (the valid Unicode range).
Last edited by Elroy; Aug 25th, 2016 at 10:39 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Okay, continuing on, I altered the way the keys were generated to restrict to valid Unicode. I provided some code above but it wasn't quite correct because I neglected to consider the sign bit of Longs. Here's the code I'm now using to generate valid Unicode characters:
Code:
'Generating ITEMS keys.
For i = 0 To ITEMS - 1
For j = 1 To 6
asKeys(i) = asKeys(i) & ChrW(Int(CDbl(Rnd()) * (CDbl(&H67E0) + CDbl(&H7000))) + &H20)
Next j
Next i
And follows is all the code I ran. Notice that I didn't bother with Randomize. I just set ITEMS = 100000, and let it go. It managed to crash on oColl.Add when i=7104 (actually the 7105's attempt to add). To sort out what happened, I used a similar approach as post #29. Here's the code:
Code:
Option Explicit
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource As Any, ByVal ByteLen As Long)
Private Declare Function OpenClipboard Lib "user32.dll" (ByVal hWnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32.dll" () As Long
Private Declare Function CloseClipboard Lib "user32.dll" () As Long
Private Declare Function GlobalAlloc Lib "kernel32.dll" (ByVal wFlags As Long, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32.dll" (ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32.dll" (ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32.dll" Alias "lstrcpyW" (ByVal lpString1 As Long, ByVal lpString2 As Long) As Long
Private Declare Function SetClipboardData Lib "user32.dll" (ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Const ITEMS As Long = 100000
'Adding 13367+ items (i=0...13366+) as below produces an access error in the
'4218th element (i=4217), adding less items grants access to all of them.
'This is element i=13366: 0x0736 1196 B6D2 EA09 97A0 A928
'This is element i= 4217: 0x1192 BB0B 80D9 D8BA 0534 5374
Private Const LOGFILE As String = "C:\Users\...\Desktop\Log.txt"
Private oColl As Collection
Private i As Long, j As Long
Private sKey As String
Private asKeys() As String
Private Sub Form_Load()
Main
End Sub
Sub Main()
ReDim asKeys(0 To ITEMS - 1) As String
Set oColl = New Collection
'Generating ITEMS keys.
For i = 0 To ITEMS - 1
For j = 1 To 6
asKeys(i) = asKeys(i) & ChrW(Int(CDbl(Rnd()) * (CDbl(&H67E0) + CDbl(&H7000))) + &H20)
Next j
Next i
'Adding ITEMS key into collection.
For i = 0 To 7103 ' <------------------- crashes on 7104
oColl.Add i, asKeys(i)
Next i
' NOW DELETE KEYS UNTIL THE ERROR GOES AWAY <--------- FINDING THE DUPLICATE!!!!!!!!!!!
On Error Resume Next
Do
oColl.Add 7104, asKeys(7104)
If Err = 0 Then
Debug.Print HexKey(asKeys(7104))
Debug.Print HexKey(sKey)
Debug.Print "The StrComp: " & StrComp(asKeys(7104), sKey, vbTextCompare)
Stop
End If
sKey = CollectionItemKey(1, oColl)
oColl.Remove 1
Err.Clear
Loop
MsgBox "Done"
End Sub
Function HexKey(sKey As String) As String
Dim j As Long
For j = 1 To 6
HexKey = HexKey & Right("000" & Hex$(AscW(Mid$(sKey, j, 1))), 4) & " "
Next j
End Function
Public Function CollectionItemKey(idx As Long, Coll As Collection) As String
' Original by LaVolpe.
Dim i As Long
Dim ptr As Long
Dim sKey As String
'
If Coll Is Nothing Then
Err.Raise 91
Exit Function
End If
'
If idx < 1 Or idx > Coll.Count Then
Err.Raise 9
Exit Function
End If
'
If idx <= Coll.Count / 2 Then ' Go from beginning or end.
CopyMemory ptr, ByVal ObjPtr(Coll) + &H18, 4 ' VbCollection.First
For i = 2 To idx
CopyMemory ptr, ByVal ptr + &H18, 4 ' VbCollectionItem.Next
Next i
Else ' End is quicker.
CopyMemory ptr, ByVal ObjPtr(Coll) + &H1C, 4 ' VbCollection.Last
For i = Coll.Count - 1 To idx Step -1
CopyMemory ptr, ByVal ptr + &H14, 4 ' VbCollectionItem.Prev
Next i
End If
'
i = StrPtr(sKey)
CopyMemory ByVal VarPtr(sKey), ByVal ptr + 16, 4
CollectionItemKey = sKey
CopyMemory ByVal VarPtr(sKey), i, 4
End Function
And here's what printed in the Immediate window when it crashed:
As can clearly be seen, it managed to produce a duplicate code.
Therefore, with this limited testing, it seems, if we restrict our Unicode to valid Unicode characters, things start to work correctly.
Next, a timing test to see if there's actually a hash table.
Regards,
Elroy
EDIT1: Also, I was a bit lucky to get an exact duplicate because I neglected to take upper/lower into account.
Last edited by Elroy; Aug 25th, 2016 at 11:23 PM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Hmmm, presence/absence of hash (or maybe sorted table for binary searches) is inconclusive. Here's how I tested:
Code:
Option Explicit
Private Declare Function QueryPerformanceCounter Lib "kernel32" (lpPerformanceCount As Currency) As Long
Private Sub Form_Load()
Dim c As New Collection
Dim i As Long, j As Long
Dim s As String
Dim nStart As Currency
Dim nEnd As Currency
Dim v As Variant
Dim sFstKey As String
Dim sMidKey As String
Dim sEndKey As String
' Create a HUGE collection
'
c.Add "aaa", "aaa"
Const max As Long = 500000
For i = 1 To max
s = ""
For j = 1 To 60
s = s & ChrW(Int(CDbl(Rnd()) * (CDbl(&H67E0) + CDbl(&H7000))) + &H20)
Next j
On Error Resume Next ' We can still hit duplicates because of Upper/Lower case complexities.
c.Add s, s
On Error GoTo 0
' EXTREMELY unlikely that we'll be on a duplicate.
If i = 2 Then sFstKey = s ' Save second key.
If i = max \ 2 Then sMidKey = s ' Save middle key.
If i = (max - 1) Then sEndKey = s ' Save next-to-last key.
Next i
MsgBox c.Count
' Get second key (and time it).
QueryPerformanceCounter nStart
v = c.Item(sFstKey)
QueryPerformanceCounter nEnd
Debug.Print nEnd - nStart
' Get middle key (and time it).
QueryPerformanceCounter nStart
v = c.Item(sMidKey)
QueryPerformanceCounter nEnd
Debug.Print nEnd - nStart
' Get next-to-last key (and time it).
QueryPerformanceCounter nStart
v = c.Item(sEndKey)
QueryPerformanceCounter nEnd
Debug.Print nEnd - nStart
End Sub
The results in the Immediate window were (in cycles):
Code:
0.0019
0.0042
0.0035
I played around with different key max value, as well as the length of the keys. It seemed to have this pattern. Second key found quickest. Middle key found slowest. But the next-to-last key was also slow.
Let's not forget that all dupes are checked when adding, so this is probably a more pure binary (memory) comparison of the keys, if it's spinning through the doubly-linked loop.
Also, I'd think, if it is searching through the doubly-linked loop, it might peek at the end a bit before just starting from the beginning each time.
I must admit, with a 60 character key (120 bytes), I did expect a larger discrepancy than this.
Has someone else done this, possibly in a different way?
Elroy
EDIT1: Also, another thought is that, during idle-time, it works on sorting the list. If this is true, these results would make sense in light of a binary search. A true binary search can't find the second key any faster than the middle key. In fact, it'll find the middle key fastest. I don't know though, and I certainly didn't give it any time to do any idle-time sorting.
EDIT2: On further reflection, sorting doesn't make sense because that'd mess up the numeric index values.
EDIT3: Yet more reflection: It actually could do a binary search, even with just the doubly-linked loop. If it knew that bouncing through pointers was very fast, but that doing long string comparisons was slow, it could spin to the middle of the loop, compare, (say we're too far), spin back to 1/4th the loop, compare, etcetera. That would dramatically reduce the string comparisons, possibly providing the above results. If there is a hash table somewhere, it just seems like someone would have dug it out, and figured out how that structure in post #15 points at it.
Last edited by Elroy; Aug 26th, 2016 at 12:42 AM.
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Let's say you have a hashtable with an initial size of 8, and the backing store is an Array of UDTs.
Code:
Private Type HashTableEntry
Key As String
Value As Variant
End Type
Private HashTable(0 to 7) As HashTableEntry
' Here's My hash function.
Function DoHash(Key As String) As Long
DoHash = AscW(Key) Mod 8
End Function
Function AddEntry(Key, Value)
Dim Index&
For Index = DoHash(Key) To 7
If HashTable(Index).Key = vbNullString Then
HashTable(Index).Key = Key
HashTable(Index).Value = Value
Exit Function
End If
Next
' Err.Raise Out of Mem...
End Function
' Retrieve our Entry
Function GetEntry(Key As String) As Variant
Dim Index&
For Index = DoHash(Key) To 7
If HashTable(Index).Key = Key Then
GetEntry = HashTable(Index).Value
Exit Function
End If
Loop
If GetEntryIndex > 7 Then Err.Raise 5
End Function
Now extrapolate that idea out to table of 100 or 1000 or 10,000 entries. It starts to become clear that even this very crude hash design could be a good deal faster than iterating over the entire table, _as_ the table size increases. Knowing this, how does someone compare the performance of a hash implementation vs an Array or Linked List.
(hopefully this isn't bringing people back to CS101)
Last edited by DEXWERX; Aug 26th, 2016 at 10:22 AM.
I must admit that it's still quite curious. The first thing I'm going to do is restrict the Unicode values to values between &h0020 and &hD7FF (the valid Unicode range).
I'm not sure where the idea arose that those values mark a "valid" Unicode range. 0x0000 is a perfectly valid Unicode char ("null" codepoint) and Windows surrogate pairs use codepoints all the way through 0xDFFF (to define values above 0xFFFF). 0xE000 to 0xFFFF also contain private-use areas (still valid!) and a number of CJK and Arabic glyphs that can be represented in UTF-16 as-is.
"Invalid" Unicode is kind of a fuzzy term, anyway, just like "invalid" ASCII doesn't make much sense. "Normalized" strings are probably a better term, as it is possible for certain Unicode encodings to be used incorrectly and thus produce "invalid" strings. (This can also be checked via API, e.g. https://msdn.microsoft.com/en-us/lib...(v=vs.85).aspx).
It's also possible that this error is caused by faulty use of surrogate-pair markers, which have specific values and must be used in specific ways in UTF-16... but that still wouldn't explain why the problem only occurs with Collections beyond a certain size.
I'm with dilettante:
I suspect that certain Unicode characters somehow cause the hash to be corrupted. The error 5 exception is probably an HRESULT bubbled up from some VB6 runtime or OLE function used to look up an item by key.
Does anyone know what function the Collection object uses for key matching? Does it wrap a generic case-insensitive C function like _wcsicmp, or is it like StrComp + vbTextCompare, where it handles a lot more than just case-insensitivity?
If it's the latter, can we find a way to make StrComp + vbTextCompare throw the same error? Alternatively, is there some way to reverse-engineer which strings were hashed to the problematic table index? I ask because the problem only seems to occur when comparing two *different* keys that have been hashed to the same index. (Otherwise, we'd see the problem always occur with a "problematic" string, instead of only occurring once the Collection grows past a certain size.)
Okay, DEX, with that particular approach to a hash table, the only way I see that it can work is that the "Value" is a collection index number. It gives you the ability to "spin" to a particular spot in the doubly-linked loop, and then to start sequentially searching from there.
And hey, if it gets us all on the same page, I've got no problems with CS101 reviews. I think we do it often in other threads.
I just don't see how this hash-table approach is going to help with an unsorted list. If (or, rather, when) a key hashes to the same hash-code, creating duplicate hash-codes, things get complicated. With your scheme, here's what'll happen:
We'll get a hash-code from our key.
We'll get an index value from the hash table.
We'll use that index value to spin to an entry in our doubly linked list.
We'll check the list's key value with the key value we want.
It won't match, but then, we'll have no idea where the key value is that we want. We won't even know if it's in front of us or behind us.
I'm sorry, but I guess I need day-two of CS101 to see how this would help.
I'm also still worried that these discussions of hash tables are a bit off-point. Have we resolved the issue that Collections should only have keys with Unicode characters in the valid UCS-2 range? And that would be anything from &h20 to &hD7FF. And, even at that, the rather complex issue of upper/lower case isn't considered (as is well illustrated in dilettante's post #5). And beyond that, are we convinced that Collections work quite well? It's a shame that Collections don't have a "UseOnlyBinaryValueOfKeyToCheckDupes" property.
If we're convinced that Collections work well with these UCS-2 and upper/lower issues considered, then let's carry on the CS101 chapter of hash tables, and how they help with these VB6 Collections.
Please, Everyone Have a Wonderful Friday,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
Elroy, I'm not sure if this is helpful, but hash tables based on linked lists are arguably the most common implementation of a hash-table. Wiki covers this in-depth in its section on chaining:
In the method known as separate chaining, each bucket is independent, and has some sort of list of entries with the same index. The time for hash table operations is the time to find the bucket (which is constant) plus the time for the list operation.
In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that. Therefore, structures that are efficient in time and space for these cases are preferred. Structures that are efficient for a fairly large number of entries per bucket are not needed or desirable. If these cases happen often, the hashing function needs to be fixed.
@Elroy - If you're questioning how my implementation "could" work - you haven't comprehended how it does work. It's the tiniest and simplest hashtable implementation I could possibly imagine.
Originally Posted by Elroy
I just don't see how this hash-table approach is going to help with an unsorted list. If (or, rather, when) a key hashes to the same hash-code, creating duplicate hash-codes, things get complicated. With your scheme, here's what'll happen:
We'll get a hash-code from our key. Yes
We'll get an index value from the hash table. <--- No, We get and Index Into the Array from the Hash Function
We'll use that index value to spin to an entry in our doubly linked list. No... we have an index into an Array, IE a direct memory location. No spinning involved. This is where all the "performance" comes from. Translating a Hash into a direct memory location without searching.
We'll check the list's key value with the key value we want. Yes
It won't match, but then, we'll have no idea where the key value is that we want. We won't even know if it's in front of us or behind us. <-- We will because it's in front, it's always in front in this implementation, It's in a following slot in front of the already populated one if theres a collision.
Good questions, I guess I forget that things implied by the code still aren't obvious to the uninitiated. Tanner's linked directly to the relevant Wikipedia section on chaining. Our hash function finds the bucket, and if there's a collision, the retrieval checks whatever else is in the bucket. For simplification - The implementation I posted just treats the whole table as a single bucket, and uses the hash function to get close. The point I was trying to illustrate was the importance of using the Hash Function to get to a spot in memory that is close (or exactly where our Entry lives.)
FYI: The only reason the Collection class maintains a linked list, is to iterate over the entries in order, and to find entries by index. This is exactly why the Collection class is so slow at retrieving entries by index. A Collection doesn't have a choice on how to search by index, except to traverse the entire list forward or backwards.
Last edited by DEXWERX; Aug 26th, 2016 at 10:54 AM.
Have we resolved the issue that Collections should only have keys with Unicode characters in the valid UCS-2 range? And that would be anything from &h20 to &hD7FF.
As far as I can tell they are only safe for characters in the current ANSI codepage. Some others may work but then still others can cause the Error 5 problem.
And it isn't merely case-insensitivity: ligatures are treated as equal to their component letters. Some codepages have more and different ligatures than others, for example the German ß glyph.
Okay, that's the last point I don't understand. Not to get defensive, but I have written binary-search routines, doubly-linked loop routines, and hash table routines before. Admittedly, it's been a while because databases just obviate the need to keep those things familiar anymore.
But let me return to your quote. So, you're not talking about the array-of-structures that makes up the doubly-linked list. You can't be. Jumping into that array at an arbitrary point wouldn't make any sense whatsoever. To traverse the double-linked-list as an array would just give you an un-sorted and un-indexed list, not much good for of anything, other than just dumping the data.
So, just to summarize. We presently know about two structures having to do with Collections (one a header and the other an array (the doubly-linked loop)). You (DEX) provided them to us in post #15. And using a bit of ObjPtr magic, we can get to both of those.
So, if I understand correctly, you're proposing that there are two additional structures (one a hash table (small-ish array) and the other a complete sorted array of collection entries). You provided a proposed structure for the hash table:
Originally Posted by DEXWERX
Code:
Private Type HashTableEntry
Key As String ' <---- actually a hash code?
Value As Variant ' <---- where this points into the array structure shown immediately below?
End Type
Private HashTable(0 to 7) As HashTableEntry
And it's now being proposed that there's another structure (that's always maintained as sorted), that may look something like this?
Code:
Private Type SortedEntries
HashCode As String
FullKey As String
CollectionValue As Variant
CollectionIndex As Long
End Type
If this is true (or some approximation), and I'm not saying it's not, there would seem to have to be some pointer to it within the Collection's header structure. If this is true, it's curious that nobody has worked it out. But I will certainly agree that this could speed key searches up. Just having the full sorted table would allow binary searches.
Regards,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.
But let me return to your quote. So, you're not talking about the array-of-structures that makes up the doubly-linked list. You can't be. Jumping into that array at an arbitrary point wouldn't make any sense whatsoever.
Oh but it does. It make complete sense, and is exactly the piece you seem to be missing. Jumping into an array of structures at an "arbitrary point" is exactly how a hash table works.
Behind the scenes of VB's Collection class is an Array of VBCOLLECTIONITEM(s).
Behind the scenes of VB's Collection class is an Array of VBCOLLECTIONITEM(s).
Yes, I think we agree that that much is clear.
So, are you proposing that VB keeps that array sorted (upon .Add)? And that the doubly-linked pointers are just there so the index order doesn't have to agree with the sort order? Actually, I suppose that's possibly an empirical question. So long as we don't do any deleting, the VBCOLLECTIONITEM items may be in contiguous memory, but that's yet another question. I suppose to call it an array, it would need to be, but I don't think we've proven that.
Even though I've called it an array, I guess I was thinking that new VBCOLLECTIONITEM(s) just grabbed another block of memory, and then patched up the pointers, possibly letting the garbage collector clean up things when items are deleted (.remove). Keeping that VBCOLLECTIONITEM(s) array sorted and contiguous wasn't something I contemplated.
Maybe I'll explore answers to those questions, or maybe someone else will.
Regards,
Elroy
Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.