postgresql Postgres中有多少表分区太多了?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6104774/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 23:02:21  来源:igfitidea点击:

How many table partitions is too many in Postgres?

performancepostgresqlpartitioning

提问by DNS

I'm partitioning a very large table that contains temporal data, and considering to what granularity I should make the partitions. The Postgres partition documentationclaims that "large numbers of partitions are likely to increase query planning time considerably" and recommends that partitioning be used with "up to perhaps a hundred" partitions.

我正在对一个包含时态数据的非常大的表进行分区,并考虑我应该进行分区的粒度。Postgres分区文档声称“大量分区可能会显着增加查询计划时间”,并建议将分区与“最多可能有一百个”分区一起使用。

Assuming my table holds ten years of data, if I partitioned by week I would end up with over 500 partitions. Before I rule this out, I'd like to better understand what impact partition quantity has on query planning time. Has anyone benchmarked this, or does anyone have an understanding of how this works internally?

假设我的表保存了十年的数据,如果我按周分区,我最终会得到 500 多个分区。在排除这一点之前,我想更好地了解分区数量对查询计划时间的影响。有没有人对此进行过基准测试,或者有没有人了解它在内部是如何工作的?

采纳答案by Greg Smith

The query planner has to do a linear search of the constraint information for every partition of tables used in the query, to figure out which are actually involved--the ones that can have rows needed for the data requested. The number of query plans the planner considers grows exponentially as you join more tables. So the exact spot where that linear search adds up to enough time to be troubling really depends on query complexity. The more joins, the worse you will get hit by this. The "up to a hundred" figure came from noting that query planning time was adding up to a non-trivial amount of time even on simpler queries around that point. On web applications in particular, where latency of response time is important, that's a problem; thus the warning.

查询规划器必须对查询中使用的每个表分区的约束信息进行线性搜索,以找出实际涉及的那些——那些可以包含所请求数据所需的行的那些。随着您加入更多表,规划器考虑的查询计划数量呈指数增长。因此,线性搜索加起来足以造成麻烦的确切位置实际上取决于查询复杂性。加入越多,您受到的打击就越严重。“多达一百”的数字来自于注意到即使在那个点附近的更简单的查询中,查询计划时间加起来也是非常重要的时间。特别是在响应时间延迟很重要的 Web 应用程序中,这是一个问题;因此警告。

Can you support 500? Sure. But you are going to be searching every one of 500 check constraints for every query plan involving that table considered by the optimizer. If query planning time isn't a concern for you, then maybe you don't care. But most sites end up disliking the proportion of time spent on query planning with that many partitions, which is one reason why monthly partitioning is the standard for most data sets. You can easily store 10 years of data, partitioned monthly, before you start crossing over into where planning overhead starts to be noticeable.

你能支持500吗?当然。但是您将要为优化器考虑的涉及该表的每个查询计划搜索 500 个检查约束中的每一个。如果查询计划时间不是您关心的问题,那么您可能并不关心。但是大多数站点最终不喜欢用这么多分区进行查询计划所花费的时间比例,这就是为什么每月分区是大多数数据集的标准的原因之一。您可以轻松地存储 10 年的数据,每月进行分区,然后再开始过渡到计划开销开始显着的地方。

回答by Denis de Bernardy

"large numbers of partitions are likely to increase query planning time considerably" and recommends that partitioning be used with "up to perhaps a hundred" partitions.

“大量分区可能会显着增加查询计划时间”,并建议将分区与“最多可能有一百个”分区一起使用。

Because every extra partition will usually be tied to check constraints, and this will lead the planner to wonder which of the partitions need to be queried against. In a best case scenario, the planner identifies that you're only hitting a single partition and gets rid of the appendstep altogether.

因为每个额外的分区通常都会与检查约束相关联,这将导致规划器想知道需要针对哪些分区进行查询。在最好的情况下,规划器确定您只访问了一个分区并append完全摆脱了该步骤。

In terms of rows, and as DNS and Seth have pointed out, your milage will vary with the hardware. Generally speaking, though, there's no significant difference between querying a 1M row table and a 10M row table -- especially if your hard drives allow for fast random access and if it's clustered (see the clusterstatement) using the index that you're most frequently hitting.

就行而言,正如 DNS 和 Seth 所指出的,您的里程将随硬件而异。不过,一般来说,查询 1M 行表和 10M 行表之间没有显着差异——尤其是如果您的硬盘驱动器允许快速随机访问,并且如果它cluster使用您最常使用的索引进行集群(请参阅声明)打。

回答by Sean

Each Table Partition takes up an inode on the file system. "Very large" is a relative term that depends on the performance characteristics of your file system of choice. If you want explicit performance benchmarks, you could probably look at various performance benchmarks of mails systems from your OS and FS of choice. Generally speaking, I wouldn't worry about it until you get in to the tens of thousands to hundreds of thousands of table spaces (using dirhashon FreeBSD's UFS2 would be win). Also note that this same limitation applies to DATABASES, TABLES or any other filesystem backed database object in PostgreSQL.

每个表分区占用文件系统上的一个 inode。“非常大”是一个相对术语,取决于您选择的文件系统的性能特征。如果您想要明确的性能基准,您可能可以查看来自您选择的操作系统和 FS 的邮件系统的各种性能基准。一般来说,我不会担心它,直到您进入数万到数十万个表空间(在 FreeBSD 的 UFS2 上使用dirhash会成功)。另请注意,同样的限制适用于 PostgreSQL 中的 DATABASES、TABLES 或任何其他文件系统支持的数据库对象。

回答by Seth Robertson

If you don't want to trust the PostgreSQL developers who wrote the code, then I recommend that you simply try it yourself and run a few example queries with explain analyze and time them using different partition schemes. Your specific hardware and software configuration is likely to dominate any answer in any case.

如果您不想信任编写代码的 PostgreSQL 开发人员,那么我建议您自己尝试一下,并使用解释分析运行一些示例查询,并使用不同的分区方案对它们计时。在任何情况下,您的特定硬件和软件配置都可能支配任何答案。

I'm assuming that the row optimization cache which the query optimizer uses to determine what joins and restrictions to use is stored with each partition, so it probably needs to load and read parts of each partition to plan the query.

我假设查询优化器用来确定要使用的连接和限制的行优化缓存存储在每个分区中,因此它可能需要加载和读取每个分区的部分来计划查询。