postgresql max() 与 ORDER BY DESC + LIMIT 1 的性能
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34246403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performance of max() vs ORDER BY DESC + LIMIT 1
提问by Geotob
I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:
我今天正在对一些慢速 SQL 查询进行故障排除,但不太明白下面的性能差异:
When trying to extract the max(timestamp)
from a data table based on some condition, using MAX()
is slower than ORDER BY timestamp LIMIT 1
if a matching row exists, but considerably faster if no matching row is found.
当尝试max(timestamp)
根据某些条件从数据表中提取 时,使用MAX()
比ORDER BY timestamp LIMIT 1
如果存在匹配行要慢,但如果没有找到匹配行,则速度要快得多。
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
There are indexes on (timestamp)
and (sensor_id, timestamp)
, and I noticed that Postgres uses very different query plans and indexes for both cases:
上有索引(timestamp)
和(sensor_id, timestamp)
,和我注意到,Postgres使用这两种情况下非常不同的查询计划和索引:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
So my two questions are:
所以我的两个问题是:
- Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
- Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an
EXISTS
check?
- 这种性能差异从何而来?我已经在MIN/MAX 与 ORDER BY 和 LIMIT 中看到了公认的答案,但这似乎并不适用于此。任何好的资源将不胜感激。
- 是否有比添加
EXISTS
检查更好的方法来提高所有情况下的性能(匹配行与不匹配行)?
EDITto address the questions in the comments below. I kept the initial query plans above for future reference:
编辑以解决下面评论中的问题。我保留了上面的初始查询计划以供将来参考:
Table definitions:
表定义:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
Note that I added ind_station_id
on sensors
just now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms
in the ORDER BY DESC + LIMIT 1
case and ~0.9ms
in the MAX
case.
请注意,我刚刚在下面@Erwin 的建议之后添加ind_station_id
了sensors
。计时还没有真正改变厉害,还是>1200ms
在ORDER BY DESC + LIMIT 1
案件和~0.9ms
在MAX
情况。
Query Plans:
查询计划:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
So just like in the earlier explains, ORDER BY
does a Scan using timestamp_in on data
, which is not done in the MAX
case.
所以就像前面解释的那样,ORDER BY
做 a Scan using timestamp_in on data
,在这种MAX
情况下没有做。
Postgres version:
Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
Postgres 版本:来自 Ubuntu 存储库的 Postgres: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
Note that there are NOT NULL
constraints in place, so ORDER BY
won't have to sort over empty rows.
请注意,存在NOT NULL
约束,因此ORDER BY
不必对空行进行排序。
Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms)
and then SELECT (~11ms)
.
另请注意,我对差异的来源很感兴趣。虽然不理想,但我可以使用EXISTS (<1ms)
和 然后相对快速地检索数据SELECT (~11ms)
。
回答by Erwin Brandstetter
There does not seem to be an index on sensor.station_id
, which is most probably important here.
上似乎没有索引sensor.station_id
,这在这里很可能很重要。
There is an actual differencebetween max()
and ORDER BY DESC + LIMIT 1
. Many people seem to miss that. NULL values sort firstin descending sort order. So ORDER BY timestamp DESC LIMIT 1
returns a row with timestamp IS NULL
if it exists, while the aggregate function max()
ignoresNULL values and returns the latest non-null timestamp.
和之间存在实际差异。很多人似乎都错过了这一点。NULL 值首先按降序排序。因此,如果存在则返回一行,而聚合函数忽略NULL 值并返回最新的非空时间戳。max()
ORDER BY DESC + LIMIT 1
ORDER BY timestamp DESC LIMIT 1
timestamp IS NULL
max()
For your case, since your column d.timestamp
is defined NOT NULL
(as your update revealed), there is no effective difference. An index with DESC NULLS LAST
and the same clause in the ORDER BY
for the LIMIT
query should still serve you best. I suggest these indexes(my query below builds on the 2nd one):
对于您的情况,由于您的列d.timestamp
已定义NOT NULL
(如您的更新所示),因此没有有效差异。与索引DESC NULLS LAST
,并在同一条款ORDER BY
的LIMIT
查询还是应该全力为您服务。我建议使用这些索引(我下面的查询建立在第二个索引上):
sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)
You can drop the other index variants and sensor_ind_timestamp
unless you have other queries that still require them (unlikely, but possible).sensor_ind_timestamp_desc
您可以删除其他索引变体,sensor_ind_timestamp
除非您有其他仍然需要它们的查询(不太可能,但可能)。sensor_ind_timestamp_desc
Much more importantly, there is another difficulty: The filter on the first table sensors
returns few, but still (possibly) multiple rows. Postgres expectsto find 2 rows (rows=2
) in your added EXPLAIN
output.
The perfect technique would be a loose index scanfor the second table data
- which is not currently implemented in Postgres 9.4 (or Postgres 9.5). You can rewrite the query to work around this limitation in various ways. Details:
更重要的是,还有另一个困难:第一个表上的过滤器sensors
返回的很少,但仍然(可能)多行。Postgres希望rows=2
在您添加的EXPLAIN
输出中找到 2 行 ( ) 。
完美的技术是对第二个表进行松散索引扫描data
- 目前尚未在 Postgres 9.4(或 Postgres 9.5)中实现。您可以通过多种方式重写查询以解决此限制。细节:
The best should be:
最好的应该是:
SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;
Since the style of outer query is mostly irrelevant, you can also just:
由于外部查询的样式大多无关,您也可以:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;
And the max()
variant should perform about as fast now:
并且该max()
变体现在应该执行得差不多快:
SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;
Or even, shortestof all:
甚至,最短的:
SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;
Note the double parentheses!
注意双括号!
The additional advantage of LIMIT
in a LATERAL
join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).
额外的优势LIMIT
在LATERAL
加入的是,你可以检索选定行的任意列,而不仅仅是最新的时间戳(一列)。
Related:
有关的:
回答by Andomar
The query plan shows index names timestamp_ind
and timestamp_sensor_ind
. But indexes like that do not help with a search for a particular sensor.
查询计划显示索引名称timestamp_ind
和timestamp_sensor_ind
. 但是这样的索引对搜索特定传感器没有帮助。
To resolve an equals query (like sensor.id = data.sensor_id
) the column has to be the first in the index. Try to add an index that allows searching on sensor_id
and, within a sensor, is sorted by timestamp:
要解析等于查询(如sensor.id = data.sensor_id
),该列必须是索引中的第一个。尝试添加允许搜索的索引,sensor_id
并在传感器内按时间戳排序:
create index sensor_timestamp_ind on data(sensor_id, timestamp);
Does adding that index speed up the query?
添加该索引是否会加快查询速度?