[RESOLVED] Method to select statistically random rows from SQL server

**joshAU** · Sep 29th, 2010, 12:29 AM

Hi people, long time no speak.

I've been looking into a few methods to select a list of statistically random rows from a sql db from vb6 (and maybe vb2008 as well).

One that seems the most interesting is on these forums, posted by szlamany Link here:
http://www.vbforums.com/showthread.p...andom+rows+sql

Essentially:

A concept you can use for any database would be:

1) Determine the number of rows in the table - SELECT COUNT(*) will do that
2) Use RND() from the client side - with the number of row value - to determine which record you want
3) Do a query something like this to get that one record

Code:
Select Top 1 *
From (Select Top {TheRandomRowFoundinStep2} *
From SomeTable Order by PriKey Desc) "XYZ"

Another method I've seen, amongst many, is to use:

Code:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

I havent yet really tried either method, but my question, I guess, is:
Is using Rnd() or NEWID() actually statistically random, or psuedo random?
I would prefer a method that is statistically random.
I suspect that szlamany's method may be the better way for me, but the second option may be easier to implement (for me at least!)

If neither of these methods are so, could anyone suggest a way to do it.
Essentially I just want to select 100 random records as statistical samples.
Thanks
josh

**TysonLPrice** · Sep 29th, 2010, 04:20 AM

This might be worth your time looking at:

http://haacked.com/archive/2004/06/21/658.aspx

**MarMan** · Sep 29th, 2010, 07:11 AM

RND() is psuedo random (formula based).

**GaryMazzone** · Sep 29th, 2010, 07:26 AM

I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)

**TysonLPrice** · Sep 29th, 2010, 08:00 AM

Originally Posted by GaryMazzone

I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)

I tried that on a large table, couple million rows, and it chugged along for quite a while. I didn't time it but a couple of minutes at least.

**GaryMazzone** · Sep 29th, 2010, 08:02 AM

So have I it depends on what you want to happen. Is this a one time deal for testing... once in a while to get data for development? If you need to generate a newId for each row and order a couple of million row table it might take a couple of min to do.

**joshAU** · Sep 29th, 2010, 08:06 AM

Thanks for your advice.

@TysonLPrice
Thanks, I had seen that post while googling it, but had another read off it and clarified a few points.

@MarMan
Thanks, I suspected that was the case.

@GaryMazzone
Thanks, % is a good idea, but for the time being at least I just need a fixed number of rows.

I've ended up using:

Code:

SELECT TOP 100 column FROM table ORDER BY NEWID()

which APPEARS to be random, and from my limited understanding of it, will not produce duplicate rows.

I guess I just don't have an understanding of how NEWID() works, so for now I think I'll just go with that method on face value, until I get around to running some some tests on it.

The following I've quoted from:
http://msdn.microsoft.com/en-us/library/Aa175776

which was a link off the page TysonLPrice recommended earlier.

The seeded random technique should never be used. It gives results that are dangerous and misleading, because it can give seemingly random results in one run, and completely non-random results in the next.

Pure random is the best technique, provided that you can afford the overhead of row-at-a-time operations to set up the table for sampling, but if this overhead is too great, consider using NewID(). Bear in mind, though, that because NewID() is assigned by the operating system, you should always test its behavior on the target system–perhaps using the code included in the accompanying Download as a starting point.

This conclusion was based on doing a Chi-squared test on each of the three methods discussed in the article, which also includes the procedures used.

Not really conclusive though, given:

because NewID() is assigned by the operating system, you should always test its behavior on the target system

So yes, I guess i'm off to do some chi-sqared tests to satisfy myself...

EDITED: sorry must have posted after two new messages I hadnt seen...

Thanks again, TysonLPrice and GaryMazzone.
I had read that it could run slowly.
Essentially I'm using it to test the accuracy of a few different prediction methods, and want to get random sample rows to calculate the error in those predictions in order to acertain the most accurate method. As such, I probably will only need to call it occasionally to recheck each method to assure myself of their relative accuracys, and therefore wont be a regularly used proceedure.
Furthermore, while the main table has 500,000 records, the table I'm using for this proceedure is only approx 40,000, so it appears to complete immediately for samples of 100 or so rows.

Thanks again for all your input.

Thread: [RESOLVED] Method to select statistically random rows from SQL server

Thread Tools

Display

[RESOLVED] Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Re: Method to select statistically random rows from SQL server

Posting Permissions