从 SQL Server 表中随机选择 n 行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/848872/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 01:59:55  来源:igfitidea点击:

Select n random rows from SQL Server table

sqlsql-serverrandom

提问by John M Gant

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.

我有一个包含大约 50,000 行的 SQL Server 表。我想随机选择大约 5,000 行。我想到了一种复杂的方法,创建一个带有“随机数”列的临时表,将我的表复制到该表中,循环遍历临时表并使用 更新每一行RAND(),然后从该表中选择随机数列 < 0.1. 我正在寻找一种更简单的方法来做到这一点,如果可能的话,在一个语句中。

This articlesuggest using the NEWID()function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.

本文建议使用该NEWID()功能。这看起来很有希望,但我看不出如何可靠地选择一定比例的行。

Anybody ever do this before? Any ideas?

以前有人这样做过吗?有任何想法吗?

回答by Ralph Shillington

select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

回应有关大表的“纯垃圾”评论:您可以这样做以提高性能。

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

这样做的成本将是值的键扫描加上连接成本,这在具有小百分比选择的大表上应该是合理的。

回答by Patrick Taylor

Depending on your needs, TABLESAMPLEwill get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

根据您的需要,TABLESAMPLE将为您提供几乎相同的随机性和更好的性能。这在 MS SQL Server 2005 及更高版本上可用。

TABLESAMPLEwill return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

TABLESAMPLE将从随机页面而不是随机行返回数据,因此 deos 甚至不会检索它不会返回的数据。

On a very large table I tested

在我测试的一张非常大的桌子上

select top 1 percent * from [tablename] order by newid()

took more than 20 minutes.

花了20多分钟。

select * from [tablename] tablesample(1 percent)

took 2 minutes.

用了 2 分钟。

Performance will also improve on smaller samples in TABLESAMPLEwhereas it will not with newid().

在较小的样本上性能也会提高,TABLESAMPLE而在newid().

Please keep in mind that this is not as random as the newid()method but will give you a decent sampling.

请记住,这不像newid()方法那样随机,但会给你一个不错的抽样。

See the MSDN page.

请参阅MSDN 页面

回答by Rob Boek

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

newid()/order by 可以工作,但对于大型结果集来说会非常昂贵,因为它必须为每一行生成一个 id,然后对它们进行排序。

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

从性能的角度来看,TABLESAMPLE() 是好的,但是您会得到成簇的结果(将返回页面上的所有行)。

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

对于性能更好的真随机样本,最好的方法是随机过滤掉行。我在 SQL Server Books Online 文章Limiting Results Sets by Using TABLESAMPLE 中找到了以下代码示例:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

如果您真的想要单个行的随机样本,请修改您的查询以随机过滤掉行,而不是使用 TABLESAMPLE。例如,以下查询使用 NEWID 函数返回 Sales.SalesOrderDetail 表中大约百分之一的行:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

SalesOrderID 列包含在 CHECKSUM 表达式中,以便 NEWID() 每行计算一次以实现基于每行的采样。表达式 CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) 计算结果为 0 和 1 之间的随机浮点值。

When run against a table with 1,000,000 rows, here are my results:

对包含 1,000,000 行的表运行时,我的结果如下:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

如果您可以避免使用 TABLESAMPLE,它将为您提供最佳性能。否则使用 newid()/filter 方法。如果您的结果集很大,则 newid()/order by 应该是最后的手段。

回答by Kyle McClellan

Selecting Rows Randomly from a Large Tableon MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.

MSDN 上的大表中随机选择行有一个简单、明确的解决方案,可以解决大规模的性能问题。

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

回答by RJardines

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.

此链接对具有 1、7 和 13 百万行的表的 Orderby(NEWID()) 和其他方法进行了有趣的比较。

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.

通常,当讨论组中提出有关如何选择随机行的问题时,会建​​议使用 NEWID 查询;它很简单,非常适合小桌子。

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

但是,当您将 NEWID 查询用于大型表时,它有一个很大的缺点。ORDER BY 子句导致表中的所有行都被复制到 tempdb 数据库中,并在其中对它们进行排序。这会导致两个问题:

  1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
  2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.
  1. 排序操作通常具有与之相关的高成本。排序可以使用大量磁盘 I/O,并且可以运行很长时间。
  2. 在最坏的情况下,tempdb 可能会耗尽空间。在最好的情况下,tempdb 会占用大量磁盘空间,如果没有手动收缩命令,这些空间永远不会被回收。

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

您需要的是一种随机选择行的方法,它不会使用 tempdb,并且不会随着表变大而变慢。这是关于如何做到这一点的新想法:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

此查询背后的基本思想是,我们希望为表中的每一行生成一个 0 到 99 之间的随机数,然后选择随机数小于指定百分比值的所有行。在本例中,我们希望随机选择大约 10% 的行;因此,我们选择随机数小于 10 的所有行。

Please read the full article in MSDN.

请阅读MSDN 中的完整文章。

回答by Oskar Austegard

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:

如果您(与 OP 不同)需要特定数量的记录(这使得 CHECKSUM 方法变得困难)并希望获得比 TABLESAMPLE 本身提供的更随机的样本,并且还想要比 CHECKSUM 更快的速度,您可以通过合并TABLESAMPLE 和 NEWID() 方法,如下所示:

DECLARE @sampleCount int = 50
SET STATISTICS TIME ON

SELECT TOP (@sampleCount) * 
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()

SET STATISTICS TIME OFF

In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

就我而言,这是随机性(我知道不是真的)和速度之间最直接的折衷。根据需要改变 TABLESAMPLE 百分比(或行) - 百分比越高,样本越随机,但预计速度会线性下降。(注意 TABLESAMPLE 不接受变量)

回答by Daniel Brückner

Just order the table by a random number and obtain the first 5,000 rows using TOP.

只需按随机数对表进行排序,然后使用 获取前 5,000 行TOP

SELECT TOP 5000 * FROM [Table] ORDER BY newid();

UPDATE

更新

Just tried it and a newid()call is sufficent - no need for all the casts and all the math.

刚刚试了一下,一个newid()电话就足够了——不需要所有的演员表和所有的数学。

回答by Nanki

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():

这是初始种子想法和校验和的组合,在我看来,它可以提供适当的随机结果,而无需使用 NEWID():

SELECT TOP [number] 
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

回答by Jeff Ferland

In MySQL you can do this:

在 MySQL 中,您可以这样做:

SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

回答by klyd

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

还没有完全看到答案中的这种变化。我有一个额外的约束,我需要给定一个初始种子,每次选择相同的行集。

For MS SQL:

对于 MS SQL:

Minimum example:

最小示例:

select top 10 percent *
from table_name
order by rand(checksum(*))

Normalized execution time: 1.00

标准化执行时间:1.00

NewId() example:

NewId() 示例:

select top 10 percent *
from table_name
order by newid()

Normalized execution time: 1.02

标准化执行时间:1.02

NewId()is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

NewId()比 慢一点rand(checksum(*)),因此您可能不想将它用于大型记录集。

Selection with Initial Seed:

用初始种子选择:

declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */

If you need to select the same set given a seed, this seems to work.

如果您需要在给定种子的情况下选择相同的集合,这似乎可行。