postgresql Postgres 中的快速随机行选择

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5297396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 22:52:27  来源:igfitidea点击:

quick random row selection in Postgres

postgresqlrandom-access

提问by Juan

I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following

我在 postgres 中有一个包含数百万行的表。我在网上查了一下,发现了以下内容

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

it works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? by the way 'myid' is an integer but it can be an empty field.

它有效,但它真的很慢......是否有另一种方法来进行该查询,或者一种直接选择随机行而不读取所有表的方法?顺便说一下,'myid' 是一个整数,但它可以是一个空字段。

thanks

谢谢

回答by NPE

You might want to experiment with OFFSET, as in

您可能想尝试使用OFFSET,如

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

The Nis the number of rows in mytable. You may need to first do a SELECT COUNT(*)to figure out the value of N.

N是行数mytable。您可能需要先执行 aSELECT COUNT(*)来计算 的值N

Update(by Antony Hatchkins)

更新(安东尼·哈奇金斯)

You must use floorhere:

你必须floor在这里使用:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

Consider a table of 2 rows; random()*Ngenerates 0 <= x < 2and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;returns 0 rows because of implicit rounding to nearest int.

考虑一个 2 行的表格;random()*N生成0 <= x < 2并例如SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;返回 0 行,因为隐式舍入到最近的 int。

回答by alfonx

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

PostgreSQL 9.5 引入了一种更快的样本选择新方法:TABLESAMPLE

The syntax is

语法是

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

如果您只想选择一行,这不是最佳解决方案,因为您需要知道表的 COUNT 来计算准确的百分比。

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

为了避免慢 COUNT 并对从 1 行到数十亿行的表使用快速 TABLESAMPLE,您可以执行以下操作:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

这可能看起来不那么优雅,但可能比任何其他答案都快。

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

要决定是否要使用 BERNULLI 或 SYSTEM,请阅读http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/ 上的差异

回答by John Coryat

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

我用子查询尝试了这个,它工作正常。偏移量,至少在 Postgresql v8.4.4 中工作正常。

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;

回答by Antony Hatchkins

You need to use floor:

您需要使用floor

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

回答by Kuberchaun

Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

查看此链接以获取一些不同的选项。 http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update:(A.Hatchkins)

更新:(A.哈奇金斯)

The summary of the (very) long article is as follows.

这篇(很)长的文章总结如下。

The author lists four approaches:

作者列举了四种方法:

1) ORDER BY random() LIMIT 1;-- slow

1) ORDER BY random() LIMIT 1;-- 慢

2) ORDER BY id where id>=random()*N LIMIT 1-- nonuniform if there're gaps

2) ORDER BY id where id>=random()*N LIMIT 1-- 如果有间隙,则不均匀

3) random column -- needs to be updated every now and then

3)随机列——需要不时更新

4) custom random aggregate-- cunning method, could be slow: random() needs to be generated N times

4) 自定义随机聚合——狡猾的方法,可能很慢:random() 需要生成 N 次

and suggests to improve method #2 by using

并建议通过使用改进方法#2

5) ORDER BY id where id=random()*N LIMIT 1with subsequent requeries if the result is empty.

5)ORDER BY id where id=random()*N LIMIT 1如果结果为空,则进行后续请求。

回答by Tometzky

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

我想出了一个非常快速的解决方案,没有TABLESAMPLE. 比 快得多OFFSET random()*N LIMIT 1。它甚至不需要桌数。

The idea is to create an expression index with random but predictable data, for example md5(primary key).

这个想法是用随机但可预测的数据创建一个表达式索引,例如md5(primary key)

Here is a test with 1M rows sample data:

这是一个包含 100 万行样本数据的测试:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

结果:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

此查询有时(大约 1/Number_of_rows 的概率)会返回 0 行,因此需要对其进行检查并重新运行。概率也不完全相同 - 有些行比其他行更有可能。

For comparison:

比较:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

结果差异很大,但可能非常糟糕:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)

回答by daamien

The easiest and fastest way to fetch random row is to use the tsm_system_rowsextension :

获取随机行的最简单和最快的方法是使用tsm_system_rows扩展名:

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

然后您可以选择所需的确切行数:

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

这在 PostgreSQL 9.5 及更高版本中可用。

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

请参阅:https: //www.postgresql.org/docs/current/static/tsm-system-rows.html