MySQL 来自 Sql 数据库的简单随机样本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/249301/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-31 12:18:28  来源:igfitidea点击:

Simple Random Samples from a Sql database

mysqlsqlpostgresqlrandomrandom-sample

提问by ojrac

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

如何在 SQL 中获取有效的简单随机样本?有问题的数据库正在运行 MySQL;我的表至少有 200,000 行,我想要一个大约 10,000 的简单随机样本。

The "obvious" answer is to:

“显而易见”的答案是:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

对于大表,这太慢了:它为每一行调用 RAND()(已经将它放在 O(n) 处),并对它们进行排序,使其最多为 O(n lg n)。有没有比 O(n) 更快的方法?

Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.

注意:正如 Andrew Mao 在评论中指出的那样,如果您在 SQL Server 上使用这种方法,则应该使用 T-SQL 函数 NEWID(),因为 RAND()可能会为所有行返回相同的值

EDIT: 5 YEARS LATER

编辑:5年后

I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:

我用一个更大的表再次遇到了这个问题,最终使用了@ignorant 的解决方案版本,有两个调整:

  • Sample the rows to 2-5x my desired sample size, to cheaply ORDER BY RAND()
  • Save the result of RAND() to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)
  • 将行采样到我想要的样本大小的 2-5 倍,以便宜的方式按 RAND() 订购
  • 在每次插入/更新时将 RAND() 的结果保存到索引列。(如果您的数据集不是很需要更新,您可能需要找到另一种方法来保持此列的新鲜度。)

To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:

为了获取一个表的 1000 项样本,我对行进行计数并将结果采样到平均 10,000 行,其中包含frozen_rand 列:

SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high

  SELECT *
    FROM table
   WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000

(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")

(我的实际实现涉及更多工作以确保我不会欠采样,并手动包装 rand_high ,但基本思想是“随机将 N 减少到几千个。”)

While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.

虽然这会做出一些牺牲,但它允许我使用索引扫描对数据库进行采样,直到它小到足以再次按 RAND() 进行排序。

采纳答案by user12861

There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

这里有一个关于此类问题的非常有趣的讨论: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.

我认为绝对没有关于您的 O(n lg n) 解决方案是最好的表的假设。尽管实际上使用好的优化器或稍微不同的技术,您列出的查询可能会好一点,O(m*n) 其中 m 是所需的随机行数,因为它不必对整个大数组进行排序,它可以只搜索最小的 m 次。但是对于您发布的那种数字,无论如何 m 都比 lg n 大。

Three asumptions we might try out:

我们可以尝试的三个假设:

  1. there is a unique, indexed, primary key in the table

  2. the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)

  3. the unique primary key is an integer that ranges from 1 to n with no gaps

  1. 表中有一个唯一的索引主键

  2. 您要选择的随机行数(m)远小于表中的行数(n)

  3. 唯一主键是一个整数,范围从 1 到 n,没有间隙

With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.

只有假设 1 和 2,我认为这可以在 O(n) 中完成,尽管您需要将整个索引写入表以匹配假设 3,因此它不一定是快速的 O(n)。如果我们可以额外假设表格的其他优点,我们可以在 O(m log m) 中完成任务。假设 3 将是一个容易使用的不错的附加属性。有了一个很好的随机数生成器,保证在连续生成 m 个数字时没有重复,O(m) 解决方案将是可能的。

Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:

给定三个假设,基本思想是在 1 到 n 之间生成 m 个唯一的随机数,然后从表中选择具有这些键的行。我现在没有 mysql 或任何东西在我面前,所以在稍微伪代码中,这看起来像:


create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)

-- generate m random keys between 1 and n
for i = 1 to m
  insert RandomKeysAttempt select rand()*n + 1

-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt

-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
  NextAttempt = rand()*n + 1
  if not exists (select * from RandomKeys where RandomKey = NextAttempt)
    insert RandomKeys select NextAttempt

-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey

If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.

如果您真的很关心效率,您可能会考虑使用某种程序语言生成随机密钥并将结果插入数据库,因为除了 SQL 之外的几乎任何东西都可能更好地处理所需的循环和随机数生成.

回答by ignorant

I think the fastest solution is

我认为最快的解决方案是

select * from table where rand() <= .3

Here is why I think this should do the job.

这就是为什么我认为这应该可以完成工作。

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).
  • 它将为每一行创建一个随机数。数字介于 0 和 1 之间
  • 如果生成的数字介于 0 和 0.3 (30%) 之间,它会评估是否显示该行。

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

这假设 rand() 以均匀分布生成数字。这是执行此操作的最快方法。

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

我看到有人推荐了这个解决方案,但他们在没有证据的情况下被拒绝了..这就是我要说的 -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

  • 这是 O(n) 但不需要排序所以它比 O(n lg n) 快
  • mysql 非常有能力为每一行生成随机数。尝试这个 -

    从 INFORMATION_SCHEMA.TABLES 限制 10 中选择 rand();

Since the database in question is mySQL, this is the right solution.

由于有问题的数据库是 mySQL,这是正确的解决方案。

回答by Muposat

Faster Than ORDER BY RAND()

比 ORDER BY RAND() 快

I tested this method to be much faster than ORDER BY RAND(), hence it runs in O(n)time, and does so impressively fast.

我测试了这种方法比 快得多ORDER BY RAND(),因此它在O(n)时间内运行,并且速度非常快。

From http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx:

来自http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

Non-MSSQL version-- I did not test this

非 MSSQL 版本——我没有测试这个

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= RAND()

MSSQL version:

MSSQL 版本:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This will select ~1% of records. So if you need exact # of percents or records to be selected, estimate your percentage with some safety margin, then randomly pluck excess records from resulting set, using the more expensive ORDER BY RAND()method.

这将选择约 1% 的记录。因此,如果您需要选择准确的百分比或记录数,请使用一些安全裕度估计您的百分比,然后使用更昂贵的ORDER BY RAND()方法从结果集中随机抽取多余的记录。

Even Faster

甚至更快

I was able to improve upon this method even further because I had a well-known indexed column value range.

我能够进一步改进这种方法,因为我有一个众所周知的索引列值范围。

For example, if you have an indexed column with uniformly distributed integers [0..max], you can use that to randomly select N small intervals. Do this dynamically in your program to get a different set for each query run. This subset selection will be O(N), which can many orders of magnitude smaller than your full data set.

例如,如果您有一个具有均匀分布整数 [0..max] 的索引列,您可以使用它来随机选择 N 个小区间。在您的程序中动态执行此操作,以便为每次查询运行获取不同的集合。此子集选择将是O(N),它可能比您的完整数据集小许多数量级。

In my test I reduced the time needed to get 20 (out 20 mil) sample records from 3 minsusing ORDER BY RAND() down to 0.0 seconds!

在我的测试中,我使用 ORDER BY RAND()将获取 20(20 百万)样本记录所需的时间从3 分钟减少到0.0 秒

回答by David F Mayer

Just use

只需使用

WHERE RAND() < 0.1 

to get 10% of the records or

获取 10% 的记录或

WHERE RAND() < 0.01 

to get 1% of the records, etc.

获得 1% 的记录,等等。

回答by gatoatigrado

Apparently in some versions of SQL there's a TABLESAMPLEcommand, but it's not in all SQL implementations (notably, Redshift).

显然,在某些版本的 SQL 中有一个TABLESAMPLE命令,但并非在所有 SQL 实现(特别是 Redshift)中都有。

http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx

http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx

回答by gazzman

I want to point out that all of these solutions appear to sample without replacement. Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement.

我想指出,所有这些解决方案似乎都无需更换即可进行采样。从随机排序中选择前 K 行或连接到包含随机顺序的唯一键的表将产生一个不带替换的随机样本。

If you want your sample to be independent, you'll need to sample with replacement. See Question 25451034for one example of how to do this using a JOIN in a manner similar to user12861's solution. The solution is written for T-SQL, but the concept works in any SQL db.

如果您希望您的样品是独立的,则需要更换样品。请参阅问题 25451034,了解如何以类似于 user12861 的解决方案的方式使用 JOIN 执行此操作的示例。该解决方案是为 T-SQL 编写的,但该概念适用于任何 SQL 数据库。

回答by KitKat

Starting with the observation that we can retrieve the ids of a table (eg. count 5) based on a set:

从观察我们可以基于集合检索表的 id(例如计数 5)开始:

select *
from table_name
where _id in (4, 1, 2, 5, 3)

we can come to the result that if we could generate the string "(4, 1, 2, 5, 3)", then we would have a more efficient way than RAND().

我们可以得出这样的结果,如果我们可以生成字符串"(4, 1, 2, 5, 3)",那么我们将有一种比 更有效的方法RAND()

For example, in Java:

例如,在 Java 中:

ArrayList<Integer> indices = new ArrayList<Integer>(rowsCount);
for (int i = 0; i < rowsCount; i++) {
    indices.add(i);
}
Collections.shuffle(indices);
String inClause = indices.toString().replace('[', '(').replace(']', ')');

If ids have gaps, then the initial arraylist indicesis the result of an sql query on ids.

如果 ids 有间隙,则初始 arraylistindices是对 ids 的 sql 查询的结果。

回答by concat

If you need exactly mrows, realistically you'll generate your subset of IDs outside of SQL. Most methods require at some point to select the "nth" entry, and SQL tables are really not arrays at all. The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy — MySQL for example doesn't support it natively, and the lock conditions are... tricky.

如果您需要确切的m行,实际上您将在 SQL 之外生成您的 ID 子集。大多数方法需要在某个时候选择“第 n 个”条目,而 SQL 表实际上根本不是数组。假设键是连续的以便仅连接 1 和计数之间的随机整数也很难满足 - 例如 MySQL 本身不支持它,并且锁定条件......棘手

Here's an O(max(n, m lg n))-time, O(n)-space solution assuming just plain BTREE keys:

这是一个O(max(n, m lg n))-time, O(n)-space 解决方案,假设只是简单的 BTREE 键:

  1. Fetch all values of the key column of the data table in any order into an array in your favorite scripting language in O(n)
  2. Perform a Fisher-Yates shuffle, stopping after mswaps, and extract the subarray [0:m-1]in ?(m)
  3. "Join" the subarray with the original dataset (e.g. SELECT ... WHERE id IN (<subarray>)) in O(m lg n)
  1. 以任意顺序获取数据表的关键列的所有值到您喜欢的脚本语言中的数组中 O(n)
  2. 执行费雪耶茨洗牌,停药后m互换,并提取子阵[0:m-1]?(m)
  3. “加入”子数组与原始数据集(例如SELECT ... WHERE id IN (<subarray>)O(m lg n)

Any method that generates the random subset outside of SQL must have at least this complexity. The join can't be any faster than O(m lg n)with BTREE (so O(m)claims are fantasy for most engines) and the shuffle is bounded below nand m lg nand doesn't affect the asymptotic behavior.

任何在 SQL 之外生成随机子集的方法都必须至少具有这种复杂性。连接不能比O(m lg n)使用 BTREE更快(因此O(m)对大多数引擎来说都是幻想)并且洗牌被限制在下面n并且m lg n不会影响渐近行为。

In Pythonic pseudocode:

在 Pythonic 伪代码中:

ids = sql.query('SELECT id FROM t')
for i in range(m):
  r = int(random() * (len(ids) - i))
  ids[i], ids[i + r] = ids[i + r], ids[i]

results = sql.query('SELECT * FROM t WHERE id IN (%s)' % ', '.join(ids[0:m-1])

回答by Odysseus Ithaca

Select 3000 random records in Netezza:

在 Netezza 中选择 3000 条随机记录:

WITH IDS AS (
     SELECT ID
     FROM MYTABLE;
)

SELECT ID FROM IDS ORDER BY mt_random() LIMIT 3000

回答by staticsan

Maybe you could do

也许你可以做

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)