postgresql 优化 Postgres 时间戳查询范围

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13998139/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 00:39:37  来源:igfitidea点击:

Optimize Postgres timestamp query range

postgresqlindexingquery-optimizationdatabase-partitioningpostgresql-performance

提问by user1754724

I have the following table and indices defined:

我定义了下表和索引:

CREATE TABLE ticket
(
  wid bigint NOT NULL DEFAULT nextval('tickets_id_seq'::regclass),
  eid bigint,
  created timestamp with time zone NOT NULL DEFAULT now(),
  status integer NOT NULL DEFAULT 0,
  argsxml text,
  moduleid character varying(255),
  source_id bigint,
  file_type_id bigint,
  file_name character varying(255),
  status_reason character varying(255),
  ...
)

I created an index on the createdtimestamp as follows:

我在created时间戳上创建了一个索引,如下所示:

CREATE INDEX ticket_1_idx
  ON ticket
  USING btree
  (created );

and here's my query

这是我的查询

select * from ticket 
where created between '2012-12-19 00:00:00' and  '2012-12-20 00:00:00'

This was working fine until the number of records started to grow (about 5 million) and now it's taking forever to return.

这一直很好,直到记录数量开始增长(大约 500 万条),现在它需要很长时间才能返回。

Explain analyze reveals this:

解释分析揭示了这一点:

"Index Scan using ticket_1_idx on ticket  (cost=0.00..10202.64 rows=52543 width=1297) (actual time=0.109..125.704 rows=53340 loops=1)"
"  Index Cond: ((created >= '2012-12-19 00:00:00+00'::timestamp with time zone) AND (created <= '2012-12-20 00:00:00+00'::timestamp with time zone))"
"Total runtime: 175.853 ms"

So far I've tried setting

到目前为止,我已经尝试设置

random_page_cost = 1.75 
effective_cache_size = 3 

Also created

还创建了

create CLUSTER ticket USING ticket_1_idx;

Nothing works. What am I doing wrong? Why is it selecting sequential scan? The indexes are supposed to make the query fast. Anything that can be done to optimize it?

没有任何作用。我究竟做错了什么?为什么选择顺序扫描?索引应该使查询快速。有什么可以优化的吗?

回答by Erwin Brandstetter

CLUSTER

CLUSTER

If you intend to use CLUSTER, the displayed syntax is invalid.

如果您打算使用CLUSTER,则显示的语法无效。

create CLUSTER ticket USING ticket_1_idx;

create CLUSTER ticket USING ticket_1_idx;

Run once:

运行一次:

CLUSTER ticket USING ticket_1_idx;

This canhelp a lot with bigger result sets. Not so much for a single row returned.
Postgres remembers which index to use for subsequent calls. If your table isn't read-only the effect deteriorates over time and you need to re-run at certain intervals:

可以帮助很多有更大的结果集。返回的单行不是那么多。
Postgres 记住用于后续调用的索引。如果您的表不是只读的,效果会随着时间的推移而恶化,您需要以特定时间间隔重新运行:

CLUSTER ticket;

Possibly only on volatile partitions. See below.

可能仅在易失分区上。见下文。

However, if you have lots of updates, CLUSTER(or VACUUM FULL) may actually be bad for performance. The right amount of bloat allows UPDATEto place new row versions on the same data page and avoids the need for physically extending the underlying file in the OS too often. You can use a carefully tuned FILLFACTORto get the best of both worlds:

但是,如果您有大量更新,CLUSTER(或VACUUM FULL)实际上可能对性能不利。适量的膨胀允许UPDATE在同一数据页上放置新的行版本,并避免在操作系统中过于频繁地物理扩展底层文件的需要。您可以使用经过仔细调整的方法FILLFACTOR来两全其美:

pg_repack

pg_repack

CLUSTERtakes an exclusive lock on the table, which may be a problem in a multi-user environment. Quoting the manual:

CLUSTER对表采取排他锁,这在多用户环境中可能是一个问题。引用手册:

When a table is being clustered, an ACCESS EXCLUSIVElock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTERis finished.

当一个表被集群时,ACCESS EXCLUSIVE会在它上面获得一个锁。这可以防止任何其他数据库操作(读取和写入)对表进行操作,直到CLUSTER完成。

Bold emphasis mine. Consider the alternative pg_repack:

大胆强调我的。考虑替代方案pg_repack

Unlike CLUSTERand VACUUM FULLit works online, without holding an exclusive lock on the processed tables during processing. pg_repack is efficient to boot, with performance comparable to using CLUSTERdirectly.

CLUSTERandVACUUM FULL它不同,它在线工作,在处理过程中不会对已处理的表持有排他锁。pg_repack 启动效率高,性能堪比CLUSTER直接使用。

and:

和:

pg_repack needs to take an exclusive lock at the end of the reorganization.

pg_repack 需要在重组结束时采取排他锁。

Version 1.3.1 works with:

1.3.1 版适用于:

PostgreSQL 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4

PostgreSQL 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4

Version 1.4.2 works with:

1.4.2 版适用于:

PostgreSQL 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10

PostgreSQL 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10

Query

询问

The query is simple enough not to cause any performance problems per se.

该查询本身足够简单,不会导致任何性能问题。

However, a word on correctness: The BETWEENconstruct includesborders. Your query selects all of Dec. 19, plusrecords from Dec. 20, 00:00 hours. That's an extremely unlikelyrequirement. Chances are, you really want:

但是,关于正确性的一句话:BETWEEN构造包括边界。您的查询选择 12 月 19 日的所有时间,以及12 月 20 日 00:00 的记录。这是一个极不可能的要求。很有可能,你真的想要:

SELECT *
FROM   ticket 
WHERE  created >= '2012-12-19 0:0'
AND    created <  '2012-12-20 0:0';

Performance

表现

First off, you ask:

首先,你问:

Why is it selecting sequential scan?

为什么选择顺序扫描?

Your EXPLAINoutput clearly shows an Index Scan, not a sequential table scan. There must be some kind of misunderstanding.

您的EXPLAIN输出清楚地显示了Index Scan,而不是顺序表扫描。一定有什么误会。

If you are pressed hard for better performance, you may be able to improve things. But the necessary background information is not in the question. Possible options include:

如果你为了更好的表现而努力,你也许可以改进。但必要的背景信息不在问题中。可能的选项包括:

  • You could only query required columns instead of *to reduce transfer cost (and possibly other performance benefits).

  • You could look at partitioningand put practical time slices into separate tables. Add indexes to partitions as needed.

  • If partitioning is not an option, another related but less intrusive technique would be to add one or more partial indexes.
    For example, if you mostly query the current month, you could create the following partial index:

    CREATE INDEX ticket_created_idx ON ticket(created)
    WHERE created >= '2012-12-01 00:00:00'::timestamp;
    

    CREATEa new index right beforethe start of a new month. You can easily automate the task with a cron job. Optionally DROPpartial indexes for old months later.

  • Keep the total index in addition for CLUSTER(which cannot operate on partial indexes). If old records never change, table partitioning would help this task a lot, since you only need to re-cluster newer partitions. Then again if records never change at all, you probably don't need CLUSTER.

  • 您只能查询所需的列,而不是*降低传输成本(以及可能的其他性能优势)。

  • 您可以查看分区并将实际时间片放入单独的表中。根据需要向分区添加索引。

  • 如果分区不是一种选择,另一种相关但侵入性较小的技术是添加一个或多个部分索引
    例如,如果您主要查询当前月份,则可以创建以下部分索引:

    CREATE INDEX ticket_created_idx ON ticket(created)
    WHERE created >= '2012-12-01 00:00:00'::timestamp;
    

    CREATE新月开始的新索引。您可以使用 cron 作业轻松自动执行任务。可选DROP的旧月份的部分索引。

  • 保留总索引CLUSTER(不能对部分索引进行操作)。如果旧记录永远不会改变,表分区将大大有助于此任务,因为您只需要重新集群较新的分区。再说一次,如果记录从不改变,你可能不需要CLUSTER.

If you combine the last two steps, performance should be awesome.

如果你结合最后两个步骤,性能应该很棒。

Performance Basics

性能基础

You may be missing one of the basics. All the usual performance advice applies:

您可能缺少其中一项基础知识。所有通常的性能建议都适用: