如何在 SQL 中请求随机行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19412/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 23:08:16  来源:igfitidea点击:

How to request a random row in SQL?

sqlrandom

提问by sverrejoh

How can I request a random row (or as close to truly random as is possible) in pure SQL?

如何在纯 SQL 中请求随机行(或尽可能接近真正随机)?

回答by Yaakov Ellis

See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):

请参阅这篇文章:SQL to Select a random row from a database table。它通过在 MySQL、PostgreSQL、Microsoft SQL Server、IBM DB2 和 Oracle 中执行此操作的方法(以下内容是从该链接复制的):

Select a random row with MySQL:

使用 MySQL 随机选择一行:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

Select a random row with PostgreSQL:

使用 PostgreSQL 随机选择一行:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Select a random row with Microsoft SQL Server:

使用 Microsoft SQL Server 随机选择一行:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

Select a random row with IBM DB2

使用 IBM DB2 随机选择一行

SELECT column, RAND() as IDX 
FROM table 
ORDER BY IDX FETCH FIRST 1 ROWS ONLY

Select a random record with Oracle:

使用 Oracle 随机选择一条记录:

SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1

回答by Grey Panther

Solutions like Jeremies:

像杰里米斯这样的解决方案:

SELECT * FROM table ORDER BY RAND() LIMIT 1

work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:

工作,但他们需要对所有表进行顺序扫描(因为需要计算与每一行关联的随机值 - 以便可以确定最小的值),即使对于中等大小的表,这也可能非常慢。我的建议是使用某种索引数字列(许多表将它们作为主键),然后编写如下内容:

SELECT * FROM table WHERE num_value >= RAND() * 
    ( SELECT MAX (num_value ) FROM table ) 
ORDER BY num_value LIMIT 1

This works in logarithmic time, regardless of the table size, if num_valueis indexed. One caveat: this assumes that num_valueis equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).

无论表大小如何,如果num_value被索引,这都适用于对数时间。一个警告:这假设num_value在范围内均匀分布0..MAX(num_value)。如果您的数据集严重偏离此假设,您将得到偏斜的结果(某些行会比其他行出现得更频繁)。

回答by Matt Hamilton

I don't know how efficient this is, but I've used it before:

我不知道这有多有效,但我以前使用过它:

SELECT TOP 1 * FROM MyTable ORDER BY newid()

Because GUIDs are pretty random, the ordering means you get a random row.

因为 GUID 是非常随机的,所以排序意味着你会得到一个随机的行。

回答by Neel

ORDER BY NEWID()

takes 7.4 milliseconds

需要 7.4 milliseconds

WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)

takes 0.0065 milliseconds!

需要0.0065 milliseconds

I will definitely go with latter method.

我肯定会采用后一种方法。

回答by Jon Galloway

You didn't say which server you're using. In older versions of SQL Server, you can use this:

你没有说你使用的是哪个服务器。在旧版本的 SQL Server 中,您可以使用这个:

select top 1 * from mytable order by newid()

In SQL Server 2005 and up, you can use TABLESAMPLEto get a random sample that's repeatable:

在 SQL Server 2005 及更高版本中,您可以使用TABLESAMPLE来获取可重复的随机样本:

SELECT FirstName, LastName
FROM Contact 
TABLESAMPLE (1 ROWS) ;

回答by Rob Boek

For SQL Server

对于 SQL Server

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

newid()/order by 可以工作,但对于大型结果集来说会非常昂贵,因为它必须为每一行生成一个 id,然后对它们进行排序。

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

从性能的角度来看,TABLESAMPLE() 是好的,但是您会得到成簇的结果(将返回页面上的所有行)。

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

对于性能更好的真随机样本,最好的方法是随机过滤掉行。我在 SQL Server Books Online 文章Limiting Results Sets by Using TABLESAMPLE 中找到了以下代码示例:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

如果您真的想要单个行的随机样本,请修改您的查询以随机过滤掉行,而不是使用 TABLESAMPLE。例如,以下查询使用 NEWID 函数返回 Sales.SalesOrderDetail 表中大约百分之一的行:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

SalesOrderID 列包含在 CHECKSUM 表达式中,以便 NEWID() 每行计算一次以实现基于每行的采样。表达式 CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) 计算结果为 0 和 1 之间的随机浮点值。

When run against a table with 1,000,000 rows, here are my results:

对包含 1,000,000 行的表运行时,我的结果如下:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

如果您可以避免使用 TABLESAMPLE,它将为您提供最佳性能。否则使用 newid()/filter 方法。如果您的结果集很大,则 newid()/order by 应该是最后的手段。

回答by ldrut

If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.

如果可能,使用存储语句来避免 RND() 和创建记录号字段的两个索引的低效率。

PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET @n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING @n;

回答by Ishmaeel

Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):

最好的方法是为此目的在一个新列中放置一个随机值,并使用这样的东西(伪代码 + SQL):

randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")

This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.

这是 MediaWiki 代码采用的解决方案。当然,对于较小的值存在一些偏见,但他们发现当没有获取行时,将随机值包装为零就足够了。

newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.

newid() 解决方案可能需要全表扫描,以便可以为每一行分配一个新的 guid,这会降低性能。

rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and everyrow will be assigned the same "random" number.

rand() 解决方案可能根本不起作用(即使用 MSSQL),因为该函数只会被评估一次,并且每一行都将被分配相同的“随机”数。

回答by Santiago Cepas

For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):

对于 SQL Server 2005 和 2008,如果我们想要单个行的随机样本(来自联机丛书):

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)

回答by forsberg

Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):

由于不鼓励使用 RAND(),因此您可以简单地获得最大 ID (=Max):

SELECT MAX(ID) FROM TABLE;

get a random between 1..Max (=My_Generated_Random)

获得 1..Max (=My_Generated_Random) 之间的随机数

My_Generated_Random = rand_in_your_programming_lang_function(1..Max);

and then run this SQL:

然后运行此 SQL:

SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1

Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value. It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:

请注意,它将检查 ID 等于或高于所选值的任何行。也可以在表中向下搜索行,并获得等于或低于 My_Generated_Random 的 ID,然后像这样修改查询:

SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1