postgresql max() 与 ORDER BY DESC + LIMIT 1 的性能

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34246403/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 02:06:44  来源:igfitidea点击:

Performance of max() vs ORDER BY DESC + LIMIT 1

sqlpostgresqlmaxaggregatesql-limit

提问by Geotob

I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:

我今天正在对一些慢速 SQL 查询进行故障排除,但不太明白下面的性能差异:

When trying to extract the max(timestamp)from a data table based on some condition, using MAX()is slower than ORDER BY timestamp LIMIT 1if a matching row exists, but considerably faster if no matching row is found.

当尝试max(timestamp)根据某些条件从数据表中提取 时,使用MAX()ORDER BY timestamp LIMIT 1如果存在匹配行要慢,但如果没有找到匹配行,则速度要快得多。

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms 

There are indexes on (timestamp)and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:

上有索引(timestamp)(sensor_id, timestamp),和我注意到,Postgres使用这两种情况下非常不同的查询计划和索引:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

So my two questions are:

所以我的两个问题是:

  1. Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
  2. Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTScheck?
  1. 这种性能差异从何而来?我已经在MIN/MAX 与 ORDER BY 和 LIMIT 中看到了公认的答案,但这似乎并不适用于此。任何好的资源将不胜感激。
  2. 是否有比添加EXISTS检查更好的方法来提高所有情况下的性能(匹配行与不匹配行)?

EDITto address the questions in the comments below. I kept the initial query plans above for future reference:

编辑以解决下面评论中的问题。我保留了上面的初始查询计划以供将来参考:

Table definitions:

表定义:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

Note that I added ind_station_idon sensorsjust now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200msin the ORDER BY DESC + LIMIT 1case and ~0.9msin the MAXcase.

请注意,我刚刚在下面@Erwin 的建议之后添加ind_station_idsensors。计时还没有真正改变厉害,还是>1200msORDER BY DESC + LIMIT 1案件和~0.9msMAX情况。

Query Plans:

查询计划:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

So just like in the earlier explains, ORDER BYdoes a Scan using timestamp_in on data, which is not done in the MAXcase.

所以就像前面解释的那样,ORDER BY做 a Scan using timestamp_in on data,在这种MAX情况下没有做。

Postgres version: Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

Postgres 版本:来自 Ubuntu 存储库的 Postgres: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

Note that there are NOT NULLconstraints in place, so ORDER BYwon't have to sort over empty rows.

请注意,存在NOT NULL约束,因此ORDER BY不必对空行进行排序。

Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms)and then SELECT (~11ms).

另请注意,我对差异的来源很感兴趣。虽然不理想,但我可以使用EXISTS (<1ms)和 然后相对快速地检索数据SELECT (~11ms)

回答by Erwin Brandstetter

There does not seem to be an index on sensor.station_id, which is most probably important here.

上似乎没有索引sensor.station_id,这在这里很可能很重要。

There is an actual differencebetween max()and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort firstin descending sort order. So ORDER BY timestamp DESC LIMIT 1returns a row with timestamp IS NULLif it exists, while the aggregate function max()ignoresNULL values and returns the latest non-null timestamp.

和之间存在实际差异。很多人似乎都错过了这一点。NULL 值首先按降序排序。因此,如果存在则返回一行,而聚合函数忽略NULL 值并返回最新的非空时间戳。max()ORDER BY DESC + LIMIT 1ORDER BY timestamp DESC LIMIT 1timestamp IS NULLmax()

For your case, since your column d.timestampis defined NOT NULL(as your update revealed), there is no effective difference. An index with DESC NULLS LASTand the same clause in the ORDER BYfor the LIMITquery should still serve you best. I suggest these indexes(my query below builds on the 2nd one):

对于您的情况,由于您的列d.timestamp已定义NOT NULL(如您的更新所示),因此没有有效差异。与索引DESC NULLS LAST,并在同一条款ORDER BYLIMIT查询还是应该全力为您服务。我建议使用这些索引(我下面的查询建立在第二个索引上):

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

You can drop the other index variants sensor_ind_timestampand sensor_ind_timestamp_descunless you have other queries that still require them (unlikely, but possible).

您可以删除其他索引变体sensor_ind_timestampsensor_ind_timestamp_desc除非您有其他仍然需要它们的查询(不太可能,但可能)。

Much more importantly, there is another difficulty: The filter on the first table sensorsreturns few, but still (possibly) multiple rows. Postgres expectsto find 2 rows (rows=2) in your added EXPLAINoutput.
The perfect technique would be a loose index scanfor the second table data- which is not currently implemented in Postgres 9.4 (or Postgres 9.5). You can rewrite the query to work around this limitation in various ways. Details:

更重要的是,还有另一个困难:第一个表上的过滤器sensors返回的很少,但仍然(可能)多行。Postgres希望rows=2在您添加的EXPLAIN输出中找到 2 行 ( ) 。
完美的技术是对第二个表进行松散索引扫描data- 目前尚未在 Postgres 9.4(或 Postgres 9.5)中实现。您可以通过多种方式重写查询以解决此限制。细节:

The best should be:

最好的应该是:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

Since the style of outer query is mostly irrelevant, you can also just:

由于外部查询的样式大多无关,您也可以:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

And the max()variant should perform about as fast now:

并且该max()变体现在应该执行得差不多快:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

Or even, shortestof all:

甚至,最短的:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

Note the double parentheses!

注意双括号!

The additional advantage of LIMITin a LATERALjoin is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).

额外的优势LIMITLATERAL加入的是,你可以检索选定行的任意列,而不仅仅是最新的时间戳(一列)。

Related:

有关的:

回答by Andomar

The query plan shows index names timestamp_indand timestamp_sensor_ind. But indexes like that do not help with a search for a particular sensor.

查询计划显示索引名称timestamp_indtimestamp_sensor_ind. 但是这样的索引对搜索特定传感器没有帮助。

To resolve an equals query (like sensor.id = data.sensor_id) the column has to be the first in the index. Try to add an index that allows searching on sensor_idand, within a sensor, is sorted by timestamp:

要解析等于查询(如sensor.id = data.sensor_id),该列必须是索引中的第一个。尝试添加允许搜索的索引,sensor_id并在传感器内按时间戳排序:

create index sensor_timestamp_ind on data(sensor_id, timestamp);

Does adding that index speed up the query?

添加该索引是否会加快查询速度?