Results 1 to 16 of 16

Thread: [RESOLVED] Random numbers showing bias...help!

  1. #1

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Resolved [RESOLVED] Random numbers showing bias...help!

    First off, I love autonumbers, and almost always use them. For this one project, however, I need my unique identifiers to have exactly six digits, so I wrote some very basic code to generate random numbers.

    When converting the data over from the old version to the new one, I first call Randomize Timer. Then I assign new IDs as follows:

    begin loop
    lngID = Int(900000 * Rnd + 100000)
    if lngID is unique, save new ID and exit loop
    end loop

    This range gives me a six digit number that will never start with 0. The problem is that my distribution is FUBAR. Running it now I get the following distribution based on the first digit of the ID:
    Code:
    Digit	IDs
    1	368
    2	335
    3	292
    4	257
    5	226
    6	177
    7	152
    8	91
    9	34
    
    Lowest ID: 100709
    Highest ID: 994612
    Running it a second time, I get:
    Code:
    Digit	IDs
    1	340
    2	333
    3	317
    4	265
    5	225
    6	182
    7	145
    8	94
    9	31
    
    Lowest ID: 100619
    Highest ID: 982853
    Clearly the range is working properly, and it successfully gives me the thousands of unique IDs I need, but the distribution is pissing me off. Not only does it offend my perfectionist nature, but it's messing with the intent of these numbers in the first place.

    This system will eventually be managing tens of thousands of files on disk. Too many to store in a single data folder with any kind of decent access speed, so I spread them out over 10 subfolders, each (theoretically) holding 10% of the files.

    I can get around the problem by using the last digit instead of the first to define which subfolder a given file belongs in; the last digit spread is uniform. But what is up with this biased distribution on the first digit? I certainly didn't code anything in there to skew the results like this.

    Any ideas?

    Could it be that the Rnd function is not liking the large numbers?
    Could it be that I'm doing two tables at once? (Both showing the same bias.)
    Could it be something inherant to the random number generator that only becomes apparent when filtering out the duplicates?

    If nobody has any ideas, I'm open to any manual algorithms for generating random numbers.

  2. #2
    Banned timeshifter's Avatar
    Join Date
    Mar 2004
    Location
    at my desk
    Posts
    2,465

    Re: Random numbers showing bias...help!

    Put Randomize in the top of the sub. This will reset the app's seed to Windows Time and should kick most of the noticable bias.

  3. #3

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Quote Originally Posted by timeshifter
    Put Randomize in the top of the sub. This will reset the app's seed to Windows Time and should kick most of the noticable bias.
    This problem is in the conversion utility, which is a run once kinda deal. (Though I've certainly been running it a lot, in the end it will only run once.)

    The same code exists in the actual application, but I'm assuming if I can fix the problem in the converter, I can apply the same fix to the application.

    In both the converter and the application, I call Randomize Timer on startup, and then never again. For the converter that translates into calling Randomize Timer each time the conversion is run.

    But seriously, even if I weren't seeding the PRNG with the system time, this distribution is way messed up.

  4. #4
    Banned timeshifter's Avatar
    Join Date
    Mar 2004
    Location
    at my desk
    Posts
    2,465

    Re: Random numbers showing bias...help!

    Drop the Timer off of that statement. It's unnecessary.

    Also remember that randomality is a tricky beast, since the probability of me throwing a die twice, on the first set getting a 1 and a 6, on the second set getting two 3's, is identical. It is very difficult to determine if something is biased as it just may be how the random number generator picks the numbers.

  5. #5

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Quote Originally Posted by timeshifter
    Drop the Timer off of that statement. It's unnecessary.

    Also remember that randomality is a tricky beast, since the probability of me throwing a die twice, on the first set getting a 1 and a 6, on the second set getting two 3's, is identical. It is very difficult to determine if something is biased as it just may be how the random number generator picks the numbers.
    Technically there are two ways to get a 1 and 6 and only one way to get two 3s, so you're twice as likely to get the 1 and 6. But that's neither here nor there.

    Running this multiple times I consistently get 10 times as many 1s as 9s. That isn't random, even if I squint.

    Anyone have any mathematical PRNG algorithms laying around? I lost my old C books which had several, and my google-fu is weak today.

  6. #6
    Oi, fat-rag! bushmobile's Avatar
    Join Date
    Mar 2004
    Location
    on the poop deck
    Posts
    5,592

    Re: Random numbers showing bias...help!

    can you show the code that you're using to decide whether it's unique or not.

    VB's PRNG isn't brill, but that's mainly because of it's periodicity not it's distribution which should be pretty uniform

    my test:
    Code:
     1             530 
     2             530 
     3             577 
     4             582 
     5             548 
     6             574 
     7             557 
     8             540 
     9             562
    code i used
    Code:
    ' Requires reference to Microsoft Scripting Runtime
    Private Sub Form_Load()
        Dim lCount(1 To 9) As Long
        Dim dict As New Dictionary
        Dim lNum As Long
        Dim v As Variant
        
        Randomize
    
        Do Until dict.Count = 5000
            lNum = Int(900000 * Rnd + 100000)
            If Not dict.Exists(lNum) Then dict.Add lNum, lNum
        Loop
        
        For Each v In dict
            lCount(Asc(CStr(v)) - 48) = lCount(Asc(CStr(v)) - 48) + 1
        Next v
        
        For lNum = 1 To 9
            Debug.Print lNum, lCount(lNum)
        Next lNum
    End Sub

  7. #7

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    That's some awesomely compact code you got there, bushmobile. And it's encouraging. I'll see if I can pull out my code to a module, substituting an array for the database. If I can reproduce it that way, I'll post it here in a bit.

    As for your specific question, the actual code to generate the numbers is this:
    Code:
            Do While True
                lngID = Int(900000 * Rnd) + 100000
                rst.Seek "=", lngID
                If rst.NoMatch Then
                    rst.Bookmark = bk
                    rst.Edit
                    rst!TitleID = lngID
                    rst.Update
                    Exit Do
                End If
            Loop
    This is using DAO inside an MS-Access 2002 module.

    I can provide more of the code if you like.

  8. #8

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Hmmm. I created a test function using arrays instead of a database, and I get a normal distribution instead of the biased one.

    My head hurts.

  9. #9
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Random numbers showing bias...help!

    Ellis, this code is rather simplistic (or is it)? It required two seconds to generate 5000 unique 6-digit random integers on my machine:
    Code:
    Dim Unique() As Long, ItemCount As Integer
    Private Sub Form_Load()
    ReDim Unique(5000)
    Randomize
    While ItemCount < 5000
        ItemCount = ItemCount + 1
        Unique(ItemCount) = Int(Rnd * 900000) + 100000
        For I = 1 To ItemCount - 1
            If Unique(I) = Unique(ItemCount) Then
                ItemCount = ItemCount - 1
                Exit For
            End If
        Next
    Wend
    For I = 1 To ItemCount
        List1.AddItem Str(Unique(I))
    Next
    End Sub
    I threw all the numbers into a textbox to eyeball them.
    Doctor Ed

  10. #10

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Okay, I can reproduce it, but it only happens when using the table. So here's a stripped-down access database that contains:

    - An empty table with two fields
    - A module holding the function that creates the IDs
    - A query that shows the results
    - A macro to run the code and open the query (don't worry, it doesn't auto-execute on startup)

    If anyone has any ideas, I would be grateful. Also, if any mods would like to kick this over to the database forum, that might be best.
    Last edited by Ellis Dee; Sep 21st, 2007 at 09:46 AM.

  11. #11
    Oi, fat-rag! bushmobile's Avatar
    Join Date
    Mar 2004
    Location
    on the poop deck
    Posts
    5,592

    Re: Random numbers showing bias...help!

    Quote Originally Posted by Code Doc
    Ellis, this code is rather simplistic (or is it)? It required two seconds to generate 5000 unique 6-digit random integers on my machine:
    snip
    I threw all the numbers into a textbox to eyeball them.
    this'd be a faster way:
    Code:
    Private Sub Command2_Click()
        Dim bNum() As Byte, lNums() As Long
        Dim N As Long, lRnd As Long
        
        ReDim bNum(100000 To 999999)
        ReDim lNums(5000)
        Randomize
        
        Do While N < 5000
            lRnd = Int(Rnd * 900000) + 100000
            If Not bNum(lRnd) = 1 Then
                bNum(lRnd) = 1
                lNums(N) = lRnd
                N = N + 1
            End If
        Loop
        
        Erase bNum
    End Sub

  12. #12
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: Random numbers showing bias...help!

    Bushmobile's code finishes in one eyeblink (I added the listbox filling to be fair). I can usually blink my eyes in 1/50 of a second. Therefore, his routine executes about 100 times faster than mine to do exactly the same thing.
    Doctor Ed

  13. #13
    Oi, fat-rag! bushmobile's Avatar
    Join Date
    Mar 2004
    Location
    on the poop deck
    Posts
    5,592

    Re: Random numbers showing bias...help!

    @Code Doc: it's just the amount of looping involved - in your code the line If Unique(I) = Unique(ItemCount) Then is executed (a minimum of ) ~12.5 million times

    @Ellis Dee: I'm afraid I don't know enough about DAO (or VBA) to help you - perhaps generate a list of unique random numbers and then populate your table.

    Anyhoo - just PM a mod if you want the thread moved, Rob's on a the mo.

  14. #14

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Quote Originally Posted by bushmobile
    @Ellis Dee: I'm afraid I don't know enough about DAO (or VBA) to help you - perhaps generate a list of unique random numbers and then populate your table.

    Anyhoo - just PM a mod if you want the thread moved, Rob's on a the mo.
    bushmobile, thanks for trying. Were you able to reproduce the results I describe? (If, that is, you downloaded the file.) I'm wondering if it's a version issue, but if somebody with a newer Access version can reproduce the behavior, then version issues get ruled out.

    I must admit, this problem has me completely baffled.

    I already PM'd Hack.

    ETA: The problem isn't just with this conversion, though. When the system gets put into production, many more records will get added, and those new records will almost surely have the same issue. Of course, it'll only become apparent gradually, so that's a downer.

    My current idea is to reverse the digits, since the last digit has a proper distribution. That's far less than ideal, however, since then I'd have to somehow filter out the leading zeros.

    Ah well. A new day, freshly rested, maybe I'll figure it out before the board wakes up proper. If so I'll post the resolution.
    Last edited by Ellis Dee; Mar 30th, 2007 at 02:15 AM.

  15. #15

    Thread Starter
    PowerPoster Ellis Dee's Avatar
    Join Date
    Mar 2007
    Location
    New England
    Posts
    3,530

    Re: Random numbers showing bias...help!

    Quote Originally Posted by Ellis Dee
    Ah well. A new day, freshly rested, maybe I'll figure it out before the board wakes up proper. If so I'll post the resolution.
    Ha, this is almost always the way to solve a problem: get some sleep. Fair warning: this post is a long description of exactly what the problem was, how it was resolved, and the odd side effects from the flawed methodology.

    So my archaic habits came back to bite me on this one. bushmobile deserves some props for intuitively grasping the source of the problem immediately. The very first thing he posted to the thread was this: "can you show the code that you're using to decide whether it's unique or not." And that's exactly where the problem was.

    I started programming professionally back in the late 80s, when dBase IV was all the rage. (Ugh.) One of the first things you learn is to never modify indexed values when that index is active. Once I moved from xBase over to VB/Access in the mid 90s, I pretty much abandoned the xBase style of doing things in favor of SQL. VB/Access can perfectly mirror the xBase style with table-type recordsets. The thing is, the Seek method -- which is only available to table-type recordsets -- is about a zillion times faster than any SQL statement, so I've gotten back into that old-school approach.

    My loop was based on an indexed field that I was modifying while its index was active. (It has to be open in order to use Seek to determine if it was unique or not.) The solution is to switch back and forth between the indexes inside the loop.

    The faulty behavior is still puzzling; if the index is kicking the current record around, how is the loop populating every record? I added a secondary table to check what was going on, plus a counter field to the first table. Apparently the loop was generating around 5200 random IDs -- usually with no more than a literal handful of duplicates (ten!) -- to fill the 1932 records. Every single ID in the main table was generated at least twice, some as many as seven times!

    For the curious, here's the resolution database, which includes macros to show the problem and the fix. In the end, the basic lesson is one I've known for 20 years: Don't modify an indexed value, because the resulting behavior is undefined. (I'm such an idiot.)
    Last edited by Ellis Dee; Sep 21st, 2007 at 09:46 AM.

  16. #16
    PowerPoster Code Doc's Avatar
    Join Date
    Mar 2007
    Location
    Omaha, Nebraska
    Posts
    2,354

    Re: [RESOLVED] Random numbers showing bias...help!

    Bushmobile said, "@Code Doc: it's just the amount of looping involved - in your code the line If Unique(I) = Unique(ItemCount) Then is executed (a minimum of ) ~12.5 million times."
    ----------------------
    Agreed. The only shortcomings I see in your code is that the lookup byte vector is huge (100000 to 999999) and thus somewhat of a memory hog. However, the improved execution time is noteworthy indeed.

    I believe both of our codes would start to break down when the number of unique random numbers required (subset n) gets closer and closer in size to the range (population N). Then millions of random numbers have to be generated until a unique number is found, even though your lookup vector approach is vastly superior when ruling them out. At worst case, when n = N, the process is indeed brute force on the last few unique items.

    When n starts to approach N, I use a completely different approach:
    Pseudocode
    1) Build a starting array of integers from from 1 to N (the population).
    2) Scramble the entire array using N swaps.
    3) Select the subset (n) from the array by using the first n elements.

    Now you are talking about 3 continuous loops and no comparisons required. I'm not even sure that N random swaps are required in step 2 to ensure randomness within the population array, but I usually do that out of habit.
    Last edited by Code Doc; Mar 30th, 2007 at 10:07 AM. Reason: typo
    Doctor Ed

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width