|
-
Sep 29th, 2010, 12:29 AM
#1
Thread Starter
Lively Member
[RESOLVED] Method to select statistically random rows from SQL server
Hi people, long time no speak.
I've been looking into a few methods to select a list of statistically random rows from a sql db from vb6 (and maybe vb2008 as well).
One that seems the most interesting is on these forums, posted by szlamany Link here:
http://www.vbforums.com/showthread.p...andom+rows+sql
Essentially:
A concept you can use for any database would be:
1) Determine the number of rows in the table - SELECT COUNT(*) will do that
2) Use RND() from the client side - with the number of row value - to determine which record you want
3) Do a query something like this to get that one record
Code:
Select Top 1 *
From (Select Top {TheRandomRowFoundinStep2} *
From SomeTable Order by PriKey Desc) "XYZ"
Another method I've seen, amongst many, is to use:
Code:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
I havent yet really tried either method, but my question, I guess, is:
Is using Rnd() or NEWID() actually statistically random, or psuedo random?
I would prefer a method that is statistically random.
I suspect that szlamany's method may be the better way for me, but the second option may be easier to implement (for me at least!)
If neither of these methods are so, could anyone suggest a way to do it.
Essentially I just want to select 100 random records as statistical samples.
Thanks
josh
-
Sep 29th, 2010, 04:20 AM
#2
Re: Method to select statistically random rows from SQL server
-
Sep 29th, 2010, 07:11 AM
#3
Re: Method to select statistically random rows from SQL server
RND() is psuedo random (formula based).
VB6 Library
If I helped you then please help me and rate my post!
If you solved your problem, then please mark the post resolved
-
Sep 29th, 2010, 07:26 AM
#4
Re: Method to select statistically random rows from SQL server
I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)
Sometimes the Programmer
Sometimes the DBA
Mazz1
-
Sep 29th, 2010, 08:00 AM
#5
Re: Method to select statistically random rows from SQL server
 Originally Posted by GaryMazzone
I would use the NEWID() method along with a count and figure out what perecntage of records you want (use a TOP X)
I tried that on a large table, couple million rows, and it chugged along for quite a while. I didn't time it but a couple of minutes at least.
-
Sep 29th, 2010, 08:02 AM
#6
Re: Method to select statistically random rows from SQL server
So have I it depends on what you want to happen. Is this a one time deal for testing... once in a while to get data for development? If you need to generate a newId for each row and order a couple of million row table it might take a couple of min to do.
Sometimes the Programmer
Sometimes the DBA
Mazz1
-
Sep 29th, 2010, 08:06 AM
#7
Thread Starter
Lively Member
Re: Method to select statistically random rows from SQL server
Thanks for your advice.
@TysonLPrice
Thanks, I had seen that post while googling it, but had another read off it and clarified a few points.
@MarMan
Thanks, I suspected that was the case.
@GaryMazzone
Thanks, % is a good idea, but for the time being at least I just need a fixed number of rows.
I've ended up using:
Code:
SELECT TOP 100 column FROM table ORDER BY NEWID()
which APPEARS to be random, and from my limited understanding of it, will not produce duplicate rows.
I guess I just don't have an understanding of how NEWID() works, so for now I think I'll just go with that method on face value, until I get around to running some some tests on it.
The following I've quoted from:
http://msdn.microsoft.com/en-us/library/Aa175776
which was a link off the page TysonLPrice recommended earlier.
The seeded random technique should never be used. It gives results that are dangerous and misleading, because it can give seemingly random results in one run, and completely non-random results in the next.
Pure random is the best technique, provided that you can afford the overhead of row-at-a-time operations to set up the table for sampling, but if this overhead is too great, consider using NewID(). Bear in mind, though, that because NewID() is assigned by the operating system, you should always test its behavior on the target system–perhaps using the code included in the accompanying Download as a starting point.
This conclusion was based on doing a Chi-squared test on each of the three methods discussed in the article, which also includes the procedures used.
Not really conclusive though, given:
because NewID() is assigned by the operating system, you should always test its behavior on the target system
So yes, I guess i'm off to do some chi-sqared tests to satisfy myself...
EDITED: sorry must have posted after two new messages I hadnt seen...
Thanks again, TysonLPrice and GaryMazzone.
I had read that it could run slowly.
Essentially I'm using it to test the accuracy of a few different prediction methods, and want to get random sample rows to calculate the error in those predictions in order to acertain the most accurate method. As such, I probably will only need to call it occasionally to recheck each method to assure myself of their relative accuracys, and therefore wont be a regularly used proceedure.
Furthermore, while the main table has 500,000 records, the table I'm using for this proceedure is only approx 40,000, so it appears to complete immediately for samples of 100 or so rows.
Thanks again for all your input.
Last edited by joshAU; Sep 29th, 2010 at 08:15 AM.
Reason: missed 2 earlier posts while typing reply.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|