SQL 选择随机行的最佳方法 PostgreSQL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8674718/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to select random rows PostgreSQL
提问by nanounanue
I want a random selection of rows in PostgreSQL, I tried this:
我想在 PostgreSQL 中随机选择行,我试过这个:
select * from table where random() < 0.01;
But some other recommend this:
但其他一些人推荐这个:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
我有一个包含 5 亿行的非常大的表,我希望它很快。
Which approach is better? What are the differences? What is the best way to select random rows?
哪种方法更好?有什么区别?选择随机行的最佳方法是什么?
采纳答案by Erwin Brandstetter
Given your specifications (plus additional info in the comments),
鉴于您的规格(以及评论中的其他信息),
- You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
- Obviously no or few write operations.
- Your ID column has to be indexed! A primary key serves nicely.
- 您有一个数字 ID 列(整数),只有很少(或很少)间隙。
- 显然没有或很少写操作。
- 您的 ID 列必须编入索引!主键很好用。
The query below does not need a sequential scan of the big table, only an index scan.
下面的查询不需要大表的顺序扫描,只需要索引扫描。
First, get estimates for the main query:
首先,获取主查询的估计值:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
The only possibly expensive part is the count(*)
(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):
唯一可能昂贵的部分是count(*)
(对于大桌子)。鉴于上述规格,您不需要它。估算就可以了,几乎免费提供(详细说明在这里):
SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;
As long as ct
isn't muchsmaller than id_span
, the query will outperform other approaches.
只要ct
不是太大小于id_span
,查询会优于其他方法。
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
,generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
Generate random numbers in the
id
space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.Each
id
can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or useDISTINCT
).Join the
id
s to the big table. This should be very fast with the index in place.Finally trim surplus
id
s that have not been eaten by dupes and gaps. Every row has a completely equal chanceto be picked.
在
id
空间中生成随机数。您有“很少的间隙”,因此将 10%(足以轻松覆盖空白)添加到要检索的行数。每个都
id
可以被偶然选择多次(尽管 id 空间很大,但不太可能),因此将生成的数字分组(或使用DISTINCT
)。将
id
s加入大桌子。如果索引就位,这应该非常快。最后修剪
id
没有被欺骗和间隙吃掉的剩余。每一行都有完全平等的机会被选中。
Short version
精简版
You can simplifythis query. The CTE in the query above is just for educational purposes:
您可以简化此查询。上面查询中的 CTE 仅用于教育目的:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
Refine with rCTE
使用 rCTE 进行优化
Especially if you are not so sure about gaps and estimates.
特别是如果您对差距和估计不太确定。
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
SELECT *
FROM random_pick
LIMIT 1000; -- actual limit
We can work with a smaller surplusin the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively fewgaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
我们可以在基本查询中使用较小的盈余。如果间隙太多以至于我们在第一次迭代中找不到足够的行,则 rCTE 将继续使用递归项进行迭代。我们仍然需要ID 空间中相对较少的间隙,否则递归可能会在达到限制之前耗尽 - 或者我们必须从足够大的缓冲区开始,这违背了优化性能的目的。
Duplicates are eliminated by the UNION
in the rCTE.
通过UNION
rCTE中的 消除重复项。
The outer LIMIT
makes the CTE stop as soon as we have enough rows.
LIMIT
一旦我们有足够的行,外部就会使 CTE 停止。
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
该查询经过精心设计以使用可用索引,生成实际随机行,并且在我们达到限制之前不会停止(除非递归运行枯竭)。如果您要重写它,这里有许多陷阱。
Wrap into function
包装成函数
For repeated use with varying parameters:
重复使用不同的参数:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT c.reltuples * _gaps
FROM pg_class c
WHERE c.oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
SELECT *
FROM random_pick
LIMIT _limit;
END
$func$ LANGUAGE plpgsql VOLATILE ROWS 1000;
Call:
称呼:
SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);
You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE
... But that's beyond the scope of this question. See:
您甚至可以使其通用以适用于任何表:将 PK 列的名称和表作为多态类型并使用EXECUTE
......但这超出了这个问题的范围。看:
Possible alternative
可能的选择
IF your requirements allow identical sets for repeatedcalls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
如果您的要求允许重复调用的相同集合(我们正在谈论重复调用),我会考虑使用实体化视图。执行一次上述查询并将结果写入表。用户以闪电般的速度获得准随机选择。以您选择的时间间隔或事件刷新您的随机选择。
Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)
Postgres 9.5 介绍 TABLESAMPLE SYSTEM (n)
Where n
is a percentage. The manual:
哪里n
是百分比。手册:
The
BERNOULLI
andSYSTEM
sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be anyreal
-valued expression.
的
BERNOULLI
和SYSTEM
采样方法的每一个接受单个参数,它是表格样本的分数,表示为 0到100之间的百分比。此参数可以是任何带real
值的表达式。
Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
大胆强调我的。它非常快,但结果并不完全随机。又是说明书:
The
SYSTEM
method is significantly faster than theBERNOULLI
method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
该
SYSTEM
方法比BERNOULLI
指定小抽样百分比时的方法快得多,但由于聚类效应,它可能返回表的随机样本较少。
The number of rows returned can vary wildly. For our example, to get roughly1000 rows:
返回的行数可能会有很大差异。对于我们的示例,要获取大约1000 行:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
Related:
有关的:
Orinstall the additional module tsm_system_rowsto get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
或者安装附加模块tsm_system_rows以准确获取请求的行数(如果有足够的行数)并允许使用更方便的语法:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
See Evan's answerfor details.
有关详细信息,请参阅Evan 的回答。
But that's still not exactly random.
但这仍然不是完全随机的。
回答by A.H.
You can examine and compare the execution plan of both by using
您可以通过使用检查和比较两者的执行计划
EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;
A quick test on a large table1shows, that the ORDER BY
first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1
only scans the complete table once.
对大表1 的快速测试表明,ORDER BY
首先对整个表进行排序,然后选择前 1000 个项目。对大表进行排序不仅读取该表,还涉及读取和写入临时文件。该where random() < 0.1
只扫描整个表一次。
For large tables this might not what you want as even one complete table scan might take to long.
对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。
A third proposal would be
第三个建议是
select * from table where random() < 0.01 limit 1000;
This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.
一旦找到 1000 行,这个就会停止表扫描,因此会更快地返回。当然,这会稍微降低随机性,但在您的情况下,这可能已经足够了。
Edit:Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] random
returns quite a few hits.
编辑:除此之外,您还可以查看已经提出的问题。使用查询会[postgresql] random
返回相当多的命中。
- quick random row selection in Postgres
- How to retrieve randomized data rows from a postgreSQL table?
- postgres: get random entries from table - too slow
And a linked article of depez outlining several more approaches:
以及 depez 的链接文章概述了更多方法:
1"large" as in "the complete table will not fit into the memory".
1“大”,如“整个表无法放入内存”。
回答by Eric Leschinski
postgresql order by random(), select rows in random order:
postgresql order by random(),随机选择行:
select your_columns from your_table ORDER BY random()
postgresql order by random() with a distinct:
postgresql order by random() 具有不同的:
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
postgresql order by random limit one row:
postgresql 顺序随机限制一行:
select your_columns from your_table ORDER BY random() limit 1
回答by Micka?l Le Baillif
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
从 PostgreSQL 9.5 开始,有一种新语法专门用于从表中获取随机元素:
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable
.
此示例将为您提供 5% 的元素mytable
。
See more explanation on this blog post: http://www.postgresql.org/docs/current/static/sql-select.html
请参阅此博客文章的更多解释:http: //www.postgresql.org/docs/current/static/sql-select.html
回答by Donald Miner
The one with the ORDER BY is going to be the slower one.
带有 ORDER BY 的将是较慢的。
select * from table where random() < 0.01;
goes record by record, and decides to randomly filter it or not. This is going to be O(N)
because it only needs to check each record once.
select * from table where random() < 0.01;
逐条记录,并决定是否随机过滤它。这是O(N)
因为它只需要检查每条记录一次。
select * from table order by random() limit 1000;
is going to sort the entire table, then pick the first 1000. Aside of any voodoo magic behind the scenes, the order by is O(N * log N)
.
select * from table order by random() limit 1000;
将要对整个表进行排序,然后选择前 1000 个。抛开幕后的任何巫毒魔法,顺序是O(N * log N)
。
The downside to the random() < 0.01
one is that you'll get a variable number of output records.
缺点random() < 0.01
是您将获得可变数量的输出记录。
Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N)
. Implementing the shuffle in SQL sounds like quite the challenge, though.
请注意,有一种比随机排序更好的方法来打乱一组数据:Fisher-Yates Shuffle,它在O(N)
. 不过,在 SQL 中实现 shuffle 听起来是一个相当大的挑战。
回答by Bogdan Surai
Here is a decision that works for me. I guess it's very simple to understand and execute.
这是一个对我有用的决定。我想这很容易理解和执行。
SELECT
field_1,
field_2,
field_2,
random() as ordering
FROM
big_table
WHERE
some_conditions
ORDER BY
ordering
LIMIT 1000;
回答by Evan Carroll
select * from table order by random() limit 1000;
If you know how many rows you want, check out tsm_system_rows
.
如果您知道需要多少行,请查看tsm_system_rows
。
tsm_system_rows
tsm_system_rows
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.
This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
模块提供了表采样方法SYSTEM_ROWS,可以在SELECT命令的TABLESAMPLE子句中使用。
此表采样方法接受单个整数参数,即要读取的最大行数。生成的样本将始终包含恰好那么多行,除非表不包含足够的行,在这种情况下会选择整个表。与内置的 SYSTEM 采样方法一样,SYSTEM_ROWS 执行块级采样,因此样本不是完全随机的,但可能会受到聚类效应的影响,尤其是在仅请求少量行的情况下。
First install the extension
首先安装扩展
CREATE EXTENSION tsm_system_rows;
Then your query,
然后你的查询,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
回答by Nelo Mitranim
If you want just one row, you can use a calculated offset
derived from count
.
如果您只需要一行,您可以使用offset
从count
.
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
回答by Raman
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetteris possible.
Erwin Brandstetter 概述的物化视图“可能的替代方案”的变体是可能的。
Say, for example, that you don't want duplicates in the randomized values that are returned. So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.
例如,假设您不希望返回的随机值中有重复项。因此,您需要在包含您的(非随机化)值集的主表上设置一个布尔值。
Assuming this is the input table:
假设这是输入表:
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES
table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES
table once:
ID_VALUES
根据需要填充表。然后,如 Erwin 所述,创建一个物化视图,该视图将ID_VALUES
表随机化一次:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values
table.
请注意,物化视图不包含 used 列,因为这将很快过时。视图也不需要包含表中可能存在的其他列id_values
。
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values
, selecting id_values
from id_values_randomized
with a join, and applying the desired criteria to obtain only relevant possibilities. For example:
为了获得(和“消耗”)随机值,请在 上使用 UPDATE-RETURNING id_values
,id_values
从id_values_randomized
连接中选择,并应用所需的标准来仅获得相关的可能性。例如:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 5)
RETURNING id;
Change LIMIT
as necessary -- if you only need one random value at a time, change LIMIT
to 1
.
LIMIT
根据需要进行更改——如果您一次只需要一个随机值,请更改LIMIT
为1
。
With the proper indexes on id_values
, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values
table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values
table.
使用适当的索引id_values
,我相信 UPDATE-RETURNING 应该以很少的负载快速执行。它通过一次数据库往返返回随机值。“合格”行的标准可以根据需要复杂化。新行可以随时添加到id_values
表中,一旦物化视图刷新(可能在非高峰时间运行),应用程序就可以访问它们。物化视图的创建和刷新会很慢,但只需要在新的 id 添加到id_values
表中时才需要执行。
回答by user10375
One lesson from my experience:
我的经验教训之一:
offset floor(random() * N) limit 1
is not faster than order by random() limit 1
.
offset floor(random() * N) limit 1
不比 快order by random() limit 1
。
I thought the offset
approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.
我认为这种offset
方法会更快,因为它可以节省 Postgres 中的排序时间。原来不是。