SQL 选择随机行的最佳方法 PostgreSQL

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8674718/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 13:46:49  来源:igfitidea点击:

Best way to select random rows PostgreSQL

sqlperformancepostgresqlrandom

提问by nanounanue

I want a random selection of rows in PostgreSQL, I tried this:

我想在 PostgreSQL 中随机选择行,我试过这个:

select * from table where random() < 0.01;

But some other recommend this:

但其他一些人推荐这个:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast.

我有一个包含 5 亿行的非常大的表,我希望它很快。

Which approach is better? What are the differences? What is the best way to select random rows?

哪种方法更好?有什么区别?选择随机行的最佳方法是什么?

采纳答案by Erwin Brandstetter

Given your specifications (plus additional info in the comments),

鉴于您的规格(以及评论中的其他信息),

  • You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
  • Obviously no or few write operations.
  • Your ID column has to be indexed! A primary key serves nicely.
  • 您有一个数字 ID 列(整数),只有很少(或很少)间隙。
  • 显然没有或很少写操作。
  • 您的 ID 列必须编入索引!主键很好用。

The query below does not need a sequential scan of the big table, only an index scan.

下面的查询不需要大表的顺序扫描,只需要索引扫描。

First, get estimates for the main query:

首先,获取主查询的估计值:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*)(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):

唯一可能昂贵的部分是count(*)(对于大桌子)。鉴于上述规格,您不需要它。估算就可以了,几乎免费提供(详细说明在这里):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ctisn't muchsmaller than id_span, the query will outperform other approaches.

只要ct不是太大小于id_span,查询会优于其他方法。

WITH params AS (
    SELECT 1       AS min_id           -- minimum id <= current min id
         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
    FROM   params p
          ,generate_series(1, 1100) g  -- 1000 + buffer
    GROUP  BY 1                        -- trim duplicates
    ) r
JOIN   big USING (id)
LIMIT  1000;                           -- trim surplus
  • Generate random numbers in the idspace. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.

  • Each idcan be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).

  • Join the ids to the big table. This should be very fast with the index in place.

  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chanceto be picked.

  • id空间中生成随机数。您有“很少的间隙”,因此将 10%(足以轻松覆盖空白)添加到要检索的行数。

  • 每个都id可以被偶然选择多次(尽管 id 空间很大,但不太可能),因此将生成的数字分组(或使用DISTINCT)。

  • ids加入大桌子。如果索引就位,这应该非常快。

  • 最后修剪id没有被欺骗和间隙吃掉的剩余。每一行都有完全平等的机会被选中。

Short version

精简版

You can simplifythis query. The CTE in the query above is just for educational purposes:

您可以简化此查询。上面查询中的 CTE 仅用于教育目的:

SELECT *
FROM  (
    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
    FROM   generate_series(1, 1100) g
    ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

使用 rCTE 进行优化

Especially if you are not so sure about gaps and estimates.

特别是如果您对差距和估计不太确定。

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
SELECT *
FROM   random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplusin the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively fewgaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

我们可以在基本查询中使用较小的盈余。如果间隙太多以至于我们在第一次迭代中找不到足够的行,则 rCTE 将继续使用递归项进行迭代。我们仍然需要ID 空间中相对较少的间隙,否则递归可能会在达到限制之前耗尽 - 或者我们必须从足够大的缓冲区开始,这违背了优化性能的目的。

Duplicates are eliminated by the UNIONin the rCTE.

通过UNIONrCTE中的 消除重复项。

The outer LIMITmakes the CTE stop as soon as we have enough rows.

LIMIT一旦我们有足够的行,外部就会使 CTE 停止。

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

该查询经过精心设计以使用可用索引,生成实际随机行,并且在我们达到限制之前不会停止(除非递归运行枯竭)。如果您要重写它,这里有许多陷阱。

Wrap into function

包装成函数

For repeated use with varying parameters:

重复使用不同的参数:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT c.reltuples * _gaps
      FROM   pg_class c
      WHERE  c.oid = 'big'::regclass);
BEGIN

   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   SELECT *
   FROM   random_pick
   LIMIT  _limit;
END
$func$  LANGUAGE plpgsql VOLATILE ROWS 1000;

Call:

称呼:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE... But that's beyond the scope of this question. See:

您甚至可以使其通用以适用于任何表:将 PK 列的名称和表作为多态类型并使用EXECUTE......但这超出了这个问题的范围。看:

Possible alternative

可能的选择

IF your requirements allow identical sets for repeatedcalls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

如果您的要求允许重复调用的相同集合(我们正在谈论重复调用),我会考虑使用实体化视图。执行一次上述查询并将结果写入表。用户以闪电般的速度获得准随机选择。以您选择的时间间隔或事件刷新您的随机选择。

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

Postgres 9.5 介绍 TABLESAMPLE SYSTEM (n)

Where nis a percentage. The manual:

哪里n是百分比。手册:

The BERNOULLIand SYSTEMsampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

BERNOULLISYSTEM采样方法的每一个接受单个参数,它是表格样本的分数,表示为 0到100之间的百分比。此参数可以是任何带real值的表达式。

Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

大胆强调我的。它非常快,但结果并不完全随机。又是说明书:

The SYSTEMmethod is significantly faster than the BERNOULLImethod when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

SYSTEM方法比BERNOULLI指定小抽样百分比时的方法快得多,但由于聚类效应,它可能返回表的随机样本较少。

The number of rows returned can vary wildly. For our example, to get roughly1000 rows:

返回的行数可能会有很大差异。对于我们的示例,要获取大约1000 行:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Related:

有关的:

Orinstall the additional module tsm_system_rowsto get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

或者安装附加模块tsm_system_rows以准确获取请求的行数(如果有足够的行数)并允许使用更方便的语法:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answerfor details.

有关详细信息,请参阅Evan 的回答

But that's still not exactly random.

但这仍然不是完全随机的。

回答by A.H.

You can examine and compare the execution plan of both by using

您可以通过使用检查和比较两者的执行计划

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

A quick test on a large table1shows, that the ORDER BYfirst sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1only scans the complete table once.

对大表1 的快速测试表明,ORDER BY首先对整个表进行排序,然后选择前 1000 个项目。对大表进行排序不仅读取该表,还涉及读取和写入临时文件。该where random() < 0.1只扫描整个表一次。

For large tables this might not what you want as even one complete table scan might take to long.

对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。

A third proposal would be

第三个建议是

select * from table where random() < 0.01 limit 1000;

This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.

一旦找到 1000 行,这个就会停止表扫描,因此会更快地返回。当然,这会稍微降低随机性,但在您的情况下,这可能已经足够了。

Edit:Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] randomreturns quite a few hits.

编辑:除此之外,您还可以查看已经提出的问题。使用查询会[postgresql] random返回相当多的命中。

And a linked article of depez outlining several more approaches:

以及 depez 的链接文章概述了更多方法:



1"large" as in "the complete table will not fit into the memory".

1“大”,如“整个表无法放入内存”。

回答by Eric Leschinski

postgresql order by random(), select rows in random order:

postgresql order by random(),随机选择行:

select your_columns from your_table ORDER BY random()

postgresql order by random() with a distinct:

postgresql order by random() 具有不同的:

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

postgresql order by random limit one row:

postgresql 顺序随机限制一行:

select your_columns from your_table ORDER BY random() limit 1

回答by Micka?l Le Baillif

Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :

从 PostgreSQL 9.5 开始,有一种新语法专门用于从表中获取随机元素:

SELECT * FROM mytable TABLESAMPLE SYSTEM (5);

This example will give you 5% of elements from mytable.

此示例将为您提供 5% 的元素mytable

See more explanation on this blog post: http://www.postgresql.org/docs/current/static/sql-select.html

请参阅此博客文章的更多解释:http: //www.postgresql.org/docs/current/static/sql-select.html

回答by Donald Miner

The one with the ORDER BY is going to be the slower one.

带有 ORDER BY 的将是较慢的。

select * from table where random() < 0.01;goes record by record, and decides to randomly filter it or not. This is going to be O(N)because it only needs to check each record once.

select * from table where random() < 0.01;逐条记录,并决定是否随机过滤它。这是O(N)因为它只需要检查每条记录一次。

select * from table order by random() limit 1000;is going to sort the entire table, then pick the first 1000. Aside of any voodoo magic behind the scenes, the order by is O(N * log N).

select * from table order by random() limit 1000;将要对整个表进行排序,然后选择前 1000 个。抛开幕后的任何巫毒魔法,顺序是O(N * log N)

The downside to the random() < 0.01one is that you'll get a variable number of output records.

缺点random() < 0.01是您将获得可变数量的输出记录。



Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N). Implementing the shuffle in SQL sounds like quite the challenge, though.

请注意,有一种比随机排序更好的方法来打乱一组数据:Fisher-Yates Shuffle,它在O(N). 不过,在 SQL 中实现 shuffle 听起来是一个相当大的挑战。

回答by Bogdan Surai

Here is a decision that works for me. I guess it's very simple to understand and execute.

这是一个对我有用的决定。我想这很容易理解和执行。

SELECT 
  field_1, 
  field_2, 
  field_2, 
  random() as ordering
FROM 
  big_table
WHERE 
  some_conditions
ORDER BY
  ordering 
LIMIT 1000;

回答by Evan Carroll

select * from table order by random() limit 1000;

If you know how many rows you want, check out tsm_system_rows.

如果您知道需要多少行,请查看tsm_system_rows

tsm_system_rows

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.

This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

模块提供了表采样方法SYSTEM_ROWS,可以在SELECT命令的TABLESAMPLE子句中使用。

此表采样方法接受单个整数参数,即要读取的最大行数。生成的样本将始终包含恰好那么多行,除非表不包含足够的行,在这种情况下会选择整个表。与内置的 SYSTEM 采样方法一样,SYSTEM_ROWS 执行块级采样,因此样本不是完全随机的,但可能会受到聚类效应的影响,尤其是在仅请求少量行的情况下。

First install the extension

首先安装扩展

CREATE EXTENSION tsm_system_rows;

Then your query,

然后你的查询,

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

回答by Nelo Mitranim

If you want just one row, you can use a calculated offsetderived from count.

如果您只需要一行,您可以使用offsetcount.

select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));

回答by Raman

A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetteris possible.

Erwin Brandstetter 概述的物化视图“可能的替代方案”的变体是可能的。

Say, for example, that you don't want duplicates in the randomized values that are returned. So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.

例如,假设您不希望返回的随机值中有重复项。因此,您需要在包含您的(非随机化)值集的主表上设置一个布尔值。

Assuming this is the input table:

假设这是输入表:

id_values  id  |   used
           ----+--------
           1   |   FALSE
           2   |   FALSE
           3   |   FALSE
           4   |   FALSE
           5   |   FALSE
           ...

Populate the ID_VALUEStable as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUEStable once:

ID_VALUES根据需要填充表。然后,如 Erwin 所述,创建一个物化视图,该视图将ID_VALUES表随机化一次:

CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();

Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_valuestable.

请注意,物化视图不包含 used 列,因为这将很快过时。视图也不需要包含表中可能存在的其他列id_values

In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values, selecting id_valuesfrom id_values_randomizedwith a join, and applying the desired criteria to obtain only relevant possibilities. For example:

为了获得(和“消耗”)随机值,请在 上使用 UPDATE-RETURNING id_valuesid_valuesid_values_randomized连接中选择,并应用所需的标准来仅获得相关的可能性。例如:

UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 5)
RETURNING id;

Change LIMITas necessary -- if you only need one random value at a time, change LIMITto 1.

LIMIT根据需要进行更改——如果您一次只需要一个随机值,请更改LIMIT1

With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_valuestable at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_valuestable.

使用适当的索引id_values,我相信 UPDATE-RETURNING 应该以很少的负载快速执行。它通过一次数据库往返返回随机值。“合格”行的标准可以根据需要复杂化。新行可以随时添加到id_values表中,一旦物化视图刷新(可能在非高峰时间运行),应用程序就可以访问它们。物化视图的创建和刷新会很慢,但只需要在新的 id 添加到id_values表中时才需要执行。

回答by user10375

One lesson from my experience:

我的经验教训之一:

offset floor(random() * N) limit 1is not faster than order by random() limit 1.

offset floor(random() * N) limit 1不比 快order by random() limit 1

I thought the offsetapproach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.

我认为这种offset方法会更快,因为它可以节省 Postgres 中的排序时间。原来不是。