|
-
Mar 29th, 2007, 12:26 PM
#1
[RESOLVED] Random numbers showing bias...help!
First off, I love autonumbers, and almost always use them. For this one project, however, I need my unique identifiers to have exactly six digits, so I wrote some very basic code to generate random numbers.
When converting the data over from the old version to the new one, I first call Randomize Timer. Then I assign new IDs as follows:
begin loop
lngID = Int(900000 * Rnd + 100000)
if lngID is unique, save new ID and exit loop
end loop
This range gives me a six digit number that will never start with 0. The problem is that my distribution is FUBAR. Running it now I get the following distribution based on the first digit of the ID:
Code:
Digit IDs
1 368
2 335
3 292
4 257
5 226
6 177
7 152
8 91
9 34
Lowest ID: 100709
Highest ID: 994612
Running it a second time, I get:
Code:
Digit IDs
1 340
2 333
3 317
4 265
5 225
6 182
7 145
8 94
9 31
Lowest ID: 100619
Highest ID: 982853
Clearly the range is working properly, and it successfully gives me the thousands of unique IDs I need, but the distribution is pissing me off. Not only does it offend my perfectionist nature, but it's messing with the intent of these numbers in the first place.
This system will eventually be managing tens of thousands of files on disk. Too many to store in a single data folder with any kind of decent access speed, so I spread them out over 10 subfolders, each (theoretically) holding 10% of the files.
I can get around the problem by using the last digit instead of the first to define which subfolder a given file belongs in; the last digit spread is uniform. But what is up with this biased distribution on the first digit? I certainly didn't code anything in there to skew the results like this.
Any ideas?
Could it be that the Rnd function is not liking the large numbers?
Could it be that I'm doing two tables at once? (Both showing the same bias.)
Could it be something inherant to the random number generator that only becomes apparent when filtering out the duplicates?
If nobody has any ideas, I'm open to any manual algorithms for generating random numbers.
-
Mar 29th, 2007, 12:31 PM
#2
Re: Random numbers showing bias...help!
Put Randomize in the top of the sub. This will reset the app's seed to Windows Time and should kick most of the noticable bias.
-
Mar 29th, 2007, 12:36 PM
#3
Re: Random numbers showing bias...help!
 Originally Posted by timeshifter
Put Randomize in the top of the sub. This will reset the app's seed to Windows Time and should kick most of the noticable bias.
This problem is in the conversion utility, which is a run once kinda deal. (Though I've certainly been running it a lot, in the end it will only run once.)
The same code exists in the actual application, but I'm assuming if I can fix the problem in the converter, I can apply the same fix to the application.
In both the converter and the application, I call Randomize Timer on startup, and then never again. For the converter that translates into calling Randomize Timer each time the conversion is run.
But seriously, even if I weren't seeding the PRNG with the system time, this distribution is way messed up.
-
Mar 29th, 2007, 12:40 PM
#4
Re: Random numbers showing bias...help!
Drop the Timer off of that statement. It's unnecessary.
Also remember that randomality is a tricky beast, since the probability of me throwing a die twice, on the first set getting a 1 and a 6, on the second set getting two 3's, is identical. It is very difficult to determine if something is biased as it just may be how the random number generator picks the numbers.
-
Mar 29th, 2007, 12:49 PM
#5
Re: Random numbers showing bias...help!
 Originally Posted by timeshifter
Drop the Timer off of that statement. It's unnecessary.
Also remember that randomality is a tricky beast, since the probability of me throwing a die twice, on the first set getting a 1 and a 6, on the second set getting two 3's, is identical. It is very difficult to determine if something is biased as it just may be how the random number generator picks the numbers.
Technically there are two ways to get a 1 and 6 and only one way to get two 3s, so you're twice as likely to get the 1 and 6. But that's neither here nor there.
Running this multiple times I consistently get 10 times as many 1s as 9s. That isn't random, even if I squint.
Anyone have any mathematical PRNG algorithms laying around? I lost my old C books which had several, and my google-fu is weak today.
-
Mar 29th, 2007, 01:20 PM
#6
Re: Random numbers showing bias...help!
can you show the code that you're using to decide whether it's unique or not.
VB's PRNG isn't brill, but that's mainly because of it's periodicity not it's distribution which should be pretty uniform
my test:
Code:
1 530
2 530
3 577
4 582
5 548
6 574
7 557
8 540
9 562
code i used
Code:
' Requires reference to Microsoft Scripting Runtime
Private Sub Form_Load()
Dim lCount(1 To 9) As Long
Dim dict As New Dictionary
Dim lNum As Long
Dim v As Variant
Randomize
Do Until dict.Count = 5000
lNum = Int(900000 * Rnd + 100000)
If Not dict.Exists(lNum) Then dict.Add lNum, lNum
Loop
For Each v In dict
lCount(Asc(CStr(v)) - 48) = lCount(Asc(CStr(v)) - 48) + 1
Next v
For lNum = 1 To 9
Debug.Print lNum, lCount(lNum)
Next lNum
End Sub
-
Mar 29th, 2007, 02:45 PM
#7
Re: Random numbers showing bias...help!
That's some awesomely compact code you got there, bushmobile. And it's encouraging. I'll see if I can pull out my code to a module, substituting an array for the database. If I can reproduce it that way, I'll post it here in a bit.
As for your specific question, the actual code to generate the numbers is this:
Code:
Do While True
lngID = Int(900000 * Rnd) + 100000
rst.Seek "=", lngID
If rst.NoMatch Then
rst.Bookmark = bk
rst.Edit
rst!TitleID = lngID
rst.Update
Exit Do
End If
Loop
This is using DAO inside an MS-Access 2002 module.
I can provide more of the code if you like.
-
Mar 29th, 2007, 03:05 PM
#8
Re: Random numbers showing bias...help!
Hmmm. I created a test function using arrays instead of a database, and I get a normal distribution instead of the biased one.
My head hurts.
-
Mar 29th, 2007, 03:37 PM
#9
Re: Random numbers showing bias...help!
Ellis, this code is rather simplistic (or is it)? It required two seconds to generate 5000 unique 6-digit random integers on my machine:
Code:
Dim Unique() As Long, ItemCount As Integer
Private Sub Form_Load()
ReDim Unique(5000)
Randomize
While ItemCount < 5000
ItemCount = ItemCount + 1
Unique(ItemCount) = Int(Rnd * 900000) + 100000
For I = 1 To ItemCount - 1
If Unique(I) = Unique(ItemCount) Then
ItemCount = ItemCount - 1
Exit For
End If
Next
Wend
For I = 1 To ItemCount
List1.AddItem Str(Unique(I))
Next
End Sub
I threw all the numbers into a textbox to eyeball them.
-
Mar 29th, 2007, 03:40 PM
#10
Re: Random numbers showing bias...help!
Okay, I can reproduce it, but it only happens when using the table. So here's a stripped-down access database that contains:
- An empty table with two fields
- A module holding the function that creates the IDs
- A query that shows the results
- A macro to run the code and open the query (don't worry, it doesn't auto-execute on startup)
If anyone has any ideas, I would be grateful. Also, if any mods would like to kick this over to the database forum, that might be best.
Last edited by Ellis Dee; Sep 21st, 2007 at 09:46 AM.
-
Mar 29th, 2007, 04:42 PM
#11
Re: Random numbers showing bias...help!
 Originally Posted by Code Doc
Ellis, this code is rather simplistic (or is it)? It required two seconds to generate 5000 unique 6-digit random integers on my machine:
snip
I threw all the numbers into a textbox to eyeball them.
this'd be a faster way:
Code:
Private Sub Command2_Click()
Dim bNum() As Byte, lNums() As Long
Dim N As Long, lRnd As Long
ReDim bNum(100000 To 999999)
ReDim lNums(5000)
Randomize
Do While N < 5000
lRnd = Int(Rnd * 900000) + 100000
If Not bNum(lRnd) = 1 Then
bNum(lRnd) = 1
lNums(N) = lRnd
N = N + 1
End If
Loop
Erase bNum
End Sub
-
Mar 29th, 2007, 04:53 PM
#12
-
Mar 29th, 2007, 05:07 PM
#13
Re: Random numbers showing bias...help!
@Code Doc: it's just the amount of looping involved - in your code the line If Unique(I) = Unique(ItemCount) Then is executed (a minimum of ) ~12.5 million times 
@Ellis Dee: I'm afraid I don't know enough about DAO (or VBA) to help you - perhaps generate a list of unique random numbers and then populate your table.
Anyhoo - just PM a mod if you want the thread moved, Rob's on a the mo.
-
Mar 30th, 2007, 02:06 AM
#14
Re: Random numbers showing bias...help!
 Originally Posted by bushmobile
@Ellis Dee: I'm afraid I don't know enough about DAO (or VBA) to help you - perhaps generate a list of unique random numbers and then populate your table.
Anyhoo - just PM a mod if you want the thread moved, Rob's on a the mo.
bushmobile, thanks for trying. Were you able to reproduce the results I describe? (If, that is, you downloaded the file.) I'm wondering if it's a version issue, but if somebody with a newer Access version can reproduce the behavior, then version issues get ruled out.
I must admit, this problem has me completely baffled.
I already PM'd Hack.
ETA: The problem isn't just with this conversion, though. When the system gets put into production, many more records will get added, and those new records will almost surely have the same issue. Of course, it'll only become apparent gradually, so that's a downer.
My current idea is to reverse the digits, since the last digit has a proper distribution. That's far less than ideal, however, since then I'd have to somehow filter out the leading zeros.
Ah well. A new day, freshly rested, maybe I'll figure it out before the board wakes up proper. If so I'll post the resolution.
Last edited by Ellis Dee; Mar 30th, 2007 at 02:15 AM.
-
Mar 30th, 2007, 03:11 AM
#15
Re: Random numbers showing bias...help!
 Originally Posted by Ellis Dee
Ah well. A new day, freshly rested, maybe I'll figure it out before the board wakes up proper. If so I'll post the resolution.
Ha, this is almost always the way to solve a problem: get some sleep. Fair warning: this post is a long description of exactly what the problem was, how it was resolved, and the odd side effects from the flawed methodology.
So my archaic habits came back to bite me on this one. bushmobile deserves some props for intuitively grasping the source of the problem immediately. The very first thing he posted to the thread was this: "can you show the code that you're using to decide whether it's unique or not." And that's exactly where the problem was.
I started programming professionally back in the late 80s, when dBase IV was all the rage. (Ugh.) One of the first things you learn is to never modify indexed values when that index is active. Once I moved from xBase over to VB/Access in the mid 90s, I pretty much abandoned the xBase style of doing things in favor of SQL. VB/Access can perfectly mirror the xBase style with table-type recordsets. The thing is, the Seek method -- which is only available to table-type recordsets -- is about a zillion times faster than any SQL statement, so I've gotten back into that old-school approach.
My loop was based on an indexed field that I was modifying while its index was active. (It has to be open in order to use Seek to determine if it was unique or not.) The solution is to switch back and forth between the indexes inside the loop.
The faulty behavior is still puzzling; if the index is kicking the current record around, how is the loop populating every record? I added a secondary table to check what was going on, plus a counter field to the first table. Apparently the loop was generating around 5200 random IDs -- usually with no more than a literal handful of duplicates (ten!) -- to fill the 1932 records. Every single ID in the main table was generated at least twice, some as many as seven times!
For the curious, here's the resolution database, which includes macros to show the problem and the fix. In the end, the basic lesson is one I've known for 20 years: Don't modify an indexed value, because the resulting behavior is undefined. (I'm such an idiot.)
Last edited by Ellis Dee; Sep 21st, 2007 at 09:46 AM.
-
Mar 30th, 2007, 10:05 AM
#16
Re: [RESOLVED] Random numbers showing bias...help!
Bushmobile said, "@Code Doc: it's just the amount of looping involved - in your code the line If Unique(I) = Unique(ItemCount) Then is executed (a minimum of ) ~12.5 million times."
----------------------
Agreed. The only shortcomings I see in your code is that the lookup byte vector is huge (100000 to 999999) and thus somewhat of a memory hog. However, the improved execution time is noteworthy indeed.
I believe both of our codes would start to break down when the number of unique random numbers required (subset n) gets closer and closer in size to the range (population N). Then millions of random numbers have to be generated until a unique number is found, even though your lookup vector approach is vastly superior when ruling them out. At worst case, when n = N, the process is indeed brute force on the last few unique items.
When n starts to approach N, I use a completely different approach:
Pseudocode
1) Build a starting array of integers from from 1 to N (the population).
2) Scramble the entire array using N swaps.
3) Select the subset (n) from the array by using the first n elements.
Now you are talking about 3 continuous loops and no comparisons required. I'm not even sure that N random swaps are required in step 2 to ensure randomness within the population array, but I usually do that out of habit.
Last edited by Code Doc; Mar 30th, 2007 at 10:07 AM.
Reason: typo
Doctor Ed
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|