Results 1 to 7 of 7

Thread: [RESOLVED] Method to select statistically random rows from SQL server

  1. #1

    Thread Starter
    Lively Member
    Join Date
    Dec 2008
    Location
    Hobart, Tasmania
    Posts
    104

    Resolved [RESOLVED] Method to select statistically random rows from SQL server

    Hi people, long time no speak.

    I've been looking into a few methods to select a list of statistically random rows from a sql db from vb6 (and maybe vb2008 as well).

    One that seems the most interesting is on these forums, posted by szlamany Link here:
    http://www.vbforums.com/showthread.p...andom+rows+sql

    Essentially:
    A concept you can use for any database would be:

    1) Determine the number of rows in the table - SELECT COUNT(*) will do that
    2) Use RND() from the client side - with the number of row value - to determine which record you want
    3) Do a query something like this to get that one record


    Code:
    Select Top 1 *
    From (Select Top {TheRandomRowFoundinStep2} *
    From SomeTable Order by PriKey Desc) "XYZ"
    Another method I've seen, amongst many, is to use:
    Code:
    SELECT TOP 1 column FROM table
    ORDER BY NEWID()
    I havent yet really tried either method, but my question, I guess, is:
    Is using Rnd() or NEWID() actually statistically random, or psuedo random?
    I would prefer a method that is statistically random.
    I suspect that szlamany's method may be the better way for me, but the second option may be easier to implement (for me at least!)

    If neither of these methods are so, could anyone suggest a way to do it.
    Essentially I just want to select 100 random records as statistical samples.
    Thanks
    josh

  2. #2
    Wall Poster TysonLPrice's Avatar
    Join Date
    Sep 2002
    Location
    Columbus, Ohio
    Posts
    3,969

    Re: Method to select statistically random rows from SQL server

    This might be worth your time looking at:

    http://haacked.com/archive/2004/06/21/658.aspx

  3. #3
    Frenzied Member
    Join Date
    Jan 2010
    Location
    Connecticut
    Posts
    1,687

    Re: Method to select statistically random rows from SQL server

    RND() is psuedo random (formula based).
    VB6 Library

    If I helped you then please help me and rate my post!
    If you solved your problem, then please mark the post resolved

  4. #4
    A SQL Server fool GaryMazzone's Avatar
    Join Date
    Aug 2005
    Location
    Dover,NH
    Posts
    7,493

    Re: Method to select statistically random rows from SQL server

    I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)
    Sometimes the Programmer
    Sometimes the DBA

    Mazz1

  5. #5
    Wall Poster TysonLPrice's Avatar
    Join Date
    Sep 2002
    Location
    Columbus, Ohio
    Posts
    3,969

    Re: Method to select statistically random rows from SQL server

    Quote Originally Posted by GaryMazzone View Post
    I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)
    I tried that on a large table, couple million rows, and it chugged along for quite a while. I didn't time it but a couple of minutes at least.

  6. #6
    A SQL Server fool GaryMazzone's Avatar
    Join Date
    Aug 2005
    Location
    Dover,NH
    Posts
    7,493

    Re: Method to select statistically random rows from SQL server

    So have I it depends on what you want to happen. Is this a one time deal for testing... once in a while to get data for development? If you need to generate a newId for each row and order a couple of million row table it might take a couple of min to do.
    Sometimes the Programmer
    Sometimes the DBA

    Mazz1

  7. #7

    Thread Starter
    Lively Member
    Join Date
    Dec 2008
    Location
    Hobart, Tasmania
    Posts
    104

    Re: Method to select statistically random rows from SQL server

    Thanks for your advice.

    @TysonLPrice
    Thanks, I had seen that post while googling it, but had another read off it and clarified a few points.

    @MarMan
    Thanks, I suspected that was the case.

    @GaryMazzone
    Thanks, % is a good idea, but for the time being at least I just need a fixed number of rows.

    I've ended up using:
    Code:
    SELECT TOP 100 column FROM table ORDER BY NEWID()
    which APPEARS to be random, and from my limited understanding of it, will not produce duplicate rows.

    I guess I just don't have an understanding of how NEWID() works, so for now I think I'll just go with that method on face value, until I get around to running some some tests on it.

    The following I've quoted from:
    http://msdn.microsoft.com/en-us/library/Aa175776

    which was a link off the page TysonLPrice recommended earlier.

    The seeded random technique should never be used. It gives results that are dangerous and misleading, because it can give seemingly random results in one run, and completely non-random results in the next.

    Pure random is the best technique, provided that you can afford the overhead of row-at-a-time operations to set up the table for sampling, but if this overhead is too great, consider using NewID(). Bear in mind, though, that because NewID() is assigned by the operating system, you should always test its behavior on the target system–perhaps using the code included in the accompanying Download as a starting point.
    This conclusion was based on doing a Chi-squared test on each of the three methods discussed in the article, which also includes the procedures used.

    Not really conclusive though, given:
    because NewID() is assigned by the operating system, you should always test its behavior on the target system
    So yes, I guess i'm off to do some chi-sqared tests to satisfy myself...

    EDITED: sorry must have posted after two new messages I hadnt seen...

    Thanks again, TysonLPrice and GaryMazzone.
    I had read that it could run slowly.
    Essentially I'm using it to test the accuracy of a few different prediction methods, and want to get random sample rows to calculate the error in those predictions in order to acertain the most accurate method. As such, I probably will only need to call it occasionally to recheck each method to assure myself of their relative accuracys, and therefore wont be a regularly used proceedure.
    Furthermore, while the main table has 500,000 records, the table I'm using for this proceedure is only approx 40,000, so it appears to complete immediately for samples of 100 or so rows.


    Thanks again for all your input.
    Last edited by joshAU; Sep 29th, 2010 at 08:15 AM. Reason: missed 2 earlier posts while typing reply.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width