postgresql Postgres 中的快速随机行选择
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5297396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
quick random row selection in Postgres
提问by Juan
I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following
我在 postgres 中有一个包含数百万行的表。我在网上查了一下,发现了以下内容
SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;
it works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? by the way 'myid' is an integer but it can be an empty field.
它有效,但它真的很慢......是否有另一种方法来进行该查询,或者一种直接选择随机行而不读取所有表的方法?顺便说一下,'myid' 是一个整数,但它可以是一个空字段。
thanks
谢谢
回答by NPE
You might want to experiment with OFFSET
, as in
您可能想尝试使用OFFSET
,如
SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
The N
is the number of rows in mytable
. You may need to first do a SELECT COUNT(*)
to figure out the value of N
.
该N
是行数mytable
。您可能需要先执行 aSELECT COUNT(*)
来计算 的值N
。
Update(by Antony Hatchkins)
更新(安东尼·哈奇金斯)
You must use floor
here:
你必须floor
在这里使用:
SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
Consider a table of 2 rows; random()*N
generates 0 <= x < 2
and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
returns 0 rows because of implicit rounding to nearest int.
考虑一个 2 行的表格;random()*N
生成0 <= x < 2
并例如SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
返回 0 行,因为隐式舍入到最近的 int。
回答by alfonx
PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE
PostgreSQL 9.5 引入了一种更快的样本选择新方法:TABLESAMPLE
The syntax is
语法是
SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);
This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.
如果您只想选择一行,这不是最佳解决方案,因为您需要知道表的 COUNT 来计算准确的百分比。
To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:
为了避免慢 COUNT 并对从 1 行到数十亿行的表使用快速 TABLESAMPLE,您可以执行以下操作:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
...
This might not look so elegant, but probably is faster than any of the other answers.
这可能看起来不那么优雅,但可能比任何其他答案都快。
To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
要决定是否要使用 BERNULLI 或 SYSTEM,请阅读http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/ 上的差异
回答by John Coryat
I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.
我用子查询尝试了这个,它工作正常。偏移量,至少在 Postgresql v8.4.4 中工作正常。
select * from mytable offset random() * (select count(*) from mytable) limit 1 ;
回答by Antony Hatchkins
You need to use floor
:
您需要使用floor
:
SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
回答by Kuberchaun
Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
查看此链接以获取一些不同的选项。 http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
Update:(A.Hatchkins)
更新:(A.哈奇金斯)
The summary of the (very) long article is as follows.
这篇(很)长的文章总结如下。
The author lists four approaches:
作者列举了四种方法:
1) ORDER BY random() LIMIT 1;
-- slow
1) ORDER BY random() LIMIT 1;
-- 慢
2) ORDER BY id where id>=random()*N LIMIT 1
-- nonuniform if there're gaps
2) ORDER BY id where id>=random()*N LIMIT 1
-- 如果有间隙,则不均匀
3) random column -- needs to be updated every now and then
3)随机列——需要不时更新
4) custom random aggregate-- cunning method, could be slow: random() needs to be generated N times
4) 自定义随机聚合——狡猾的方法,可能很慢:random() 需要生成 N 次
and suggests to improve method #2 by using
并建议通过使用改进方法#2
5) ORDER BY id where id=random()*N LIMIT 1
with subsequent requeries if the result is empty.
5)ORDER BY id where id=random()*N LIMIT 1
如果结果为空,则进行后续请求。
回答by Tometzky
I've came up with a very fast solution without TABLESAMPLE
. Much faster than OFFSET random()*N LIMIT 1
. It doesn't even require table count.
我想出了一个非常快速的解决方案,没有TABLESAMPLE
. 比 快得多OFFSET random()*N LIMIT 1
。它甚至不需要桌数。
The idea is to create an expression index with random but predictable data, for example md5(primary key)
.
这个想法是用随机但可预测的数据创建一个表达式索引,例如md5(primary key)
。
Here is a test with 1M rows sample data:
这是一个包含 100 万行样本数据的测试:
create table randtest (id serial primary key, data int not null);
insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);
create index randtest_md5_id_idx on randtest (md5(id::text));
explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;
Result:
结果:
Limit (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
-> Index Scan using randtest_md5_id_idx on randtest (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
Filter: (md5((id)::text) > md5((random())::text))
Rows Removed by Filter: 1831
Total runtime: 6.245 ms
This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.
此查询有时(大约 1/Number_of_rows 的概率)会返回 0 行,因此需要对其进行检查并重新运行。概率也不完全相同 - 有些行比其他行更有可能。
For comparison:
比较:
explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;
Results vary widely, but can be pretty bad:
结果差异很大,但可能非常糟糕:
Limit (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
-> Seq Scan on randtest (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
Total runtime: 179.211 ms
(3 rows)
回答by daamien
The easiest and fastest way to fetch random row is to use the tsm_system_rows
extension :
获取随机行的最简单和最快的方法是使用tsm_system_rows
扩展名:
CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
Then you can select the exact number of rows you want :
然后您可以选择所需的确切行数:
SELECT myid FROM mytable TABLESAMPLE SYSTEM_ROWS(1);
This is available with PostgreSQL 9.5 and later.
这在 PostgreSQL 9.5 及更高版本中可用。
See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html
请参阅:https: //www.postgresql.org/docs/current/static/tsm-system-rows.html