postgresql 在 postgres 中的大表上加快按日期分组查询

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4675605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-20 00:29:11  来源:igfitidea点击:

Speeding up a group by date query on a big table in postgres

sqldatabasepostgresqlindexing

提问by zaius

I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.

我有一张大约有 2000 万行的表。为了论证起见,假设表中有两列 - 一个 id 和一个时间戳。我正在尝试计算每天的项目数量。这就是我目前所拥有的。

  SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:

如果没有任何索引,这需要大约 30 秒才能在我的机器上运行。这是解释分析输出:

 GroupAggregate  (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
   ->  Sort  (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
         Sort Key: (date("timestamp"))
         Sort Method:  external merge  Disk: 372496kB
         ->  Seq Scan on actions  (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
               Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 32447.762 ms

Since I'm seeing a sequential scan, I tried to index on the date aggregate

由于我看到的是顺序扫描,因此我尝试对日期聚合进行索引

CREATE INDEX ON actions (DATE(timestamp));

Which cuts the speed by about 50%.

这将速度降低了约 50%。

 HashAggregate  (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
   ->  Seq Scan on actions  (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
         Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 17038.663 ms

I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?

我是整个查询优化业务的新手,我不知道下一步该做什么。任何线索如何让这个查询运行得更快?

--edit--

- 编辑 -

It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?

看起来我正在达到指数的极限。这几乎是在该表上运行的唯一查询(尽管日期的值发生了变化)。有没有办法对表进行分区?或者创建一个包含所有计数值的缓存表?或者有其他选择吗?

采纳答案by a_horse_with_no_name

Is there a way to partition up the table?

有没有办法对表进行分区?

Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

是:http:
//www.postgresql.org/docs/current/static/ddl-partitioning.html

Or create a cache table with all the count values? Or any other options?

或者创建一个包含所有计数值的缓存表?或者有其他选择吗?

Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.

创建一个“缓存”表当然是可能的。但这取决于您需要该结果的频率以及它需要的准确度。

CREATE TABLE action_report
AS
SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Then a SELECT * FROM action_reportwill give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.

然后aSELECT * FROM action_report会及时给你你想要的。然后,您将安排一个 cron 作业定期重新创建该表。

This approach of course won't help if the time range changes with every query or if that query is only run once a day.

如果时间范围随每个查询而变化,或者如果该查询每天只运行一次,则这种方法当然无济于事。

回答by Zeki

In general most databases will ignore indexes if the expected number of rows returned is going to be high. This is because for each index hit, it will need to then find the row as well, so it's faster to just do a full table scan. This number is between 10,000 and 100,000. You can experiment with this by shrinking the date range and seeing where postgres flips to using the index. In this case, postgres is planning to scan 17,301,674 rows, so your table is pretty large. If you make it really small and you still feel like postgres is making the wrong choice then try running an analyze on the table so that postgres gets its approximations right.

一般来说,如果预期返回的行数会很高,大多数数据库将忽略索引。这是因为对于每个索引命中,它也需要找到该行,所以只进行全表扫描会更快。这个数字在 10,000 到 100,000 之间。您可以通过缩小日期范围并查看 postgres 在哪里使用索引来进行试验。在这种情况下,postgres 计划扫描 17,301,674 行,因此您的表非常大。如果你让它变得非常小并且你仍然觉得 postgres 做出了错误的选择,那么尝试在表上运行一个分析,以便 postgres 得到它的近似值。

回答by Peter Eisentraut

Set work_memto say 2GB and see if that changes the plan. If it doesn't, you might be out of options.

设置work_mem为 2GB,看看是否会改变计划。如果没有,您可能别无选择。

回答by RichardTheKiwi

It looks like the range covers just about covers all the data available.

看起来范围几乎涵盖了所有可用数据。

This could be a design issue. If you will be running this often, you are better off creating an additional column timestamp_date that contains only the date. Then create an index on that column, and change the query accordingly. The column should be maintained by insert+update triggers.

这可能是一个设计问题。如果您经常运行此程序,最好创建一个仅包含日期的附加列 timestamp_date。然后在该列上创建一个索引,并相应地更改查询。该列应由插入+更新触发器维护。

SELECT timestamp_date AS day, COUNT(*)
FROM actions
WHERE timestamp_date >= '20100101'
  AND timestamp_date <  '20110101'
GROUP BY day;

If I am wrong about the number of rows the date range will find (and it is only a small subset), then you can try an index on just the timestamp column itself, applying the WHERE clause to just the column (which given the range works just as well)

如果我对日期范围将找到的行数有误(并且它只是一个很小的子集),那么您可以尝试仅对时间戳列本身建立索引,将 WHERE 子句仅应用于该列(给定范围)也能正常工作)

SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE timestamp >= '20100101'
  AND timestamp <  '20110101'
GROUP BY day;

回答by araqnid

Try running explain analyze verbose ...to see if the aggregate is using a temp file. Perhaps increase work_memto allow more to be done in memory?

尝试运行explain analyze verbose ...以查看聚合是否正在使用临时文件。也许增加work_mem以允许在内存中完成更多工作?

回答by Kiriakos Georgiou

What you really want for such DSS type queries is a date table that describes days. In database design lingo it's called a date dimension. To populate such table you can use the code I posted in this article: http://www.mockbites.com/articles/tech/data_mart_temporal

对于此类 DSS 类型查询,您真正需要的是描述天的日期表。在数据库设计术语中,它被称为日期维度。要填充此类表,您可以使用我在本文中发布的代码:http: //www.mockbites.com/articles/tech/data_mart_temporal

Then in each row in your actions table put the appropriate date_key.

然后在您的操作表的每一行中放置适当的 date_key。

Your query then becomes:

然后您的查询变为:

SELECT
   d.full_date, COUNT(*)
FROM actions a 
JOIN date_dimension d 
    ON a.date_key = d.date_key
WHERE d.full_date = '2010/01/01'
GROUP BY d.full_date

Assuming indices on the keys and full_date, this will be super fast because it operates on INT4 keys!

假设键和 full_date 上的索引,这将是超快的,因为它在 INT4 键上运行!

Another benefit is that you can now slice and dice by any other date_dimension column(s).

另一个好处是您现在可以按任何其他 date_dimension 列进行切片和切块。