Postgresql 表中的最大(可用)行数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3132444/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 22:36:25  来源:igfitidea点击:

Maximum (usable) number of rows in a Postgresql table

postgresql

提问by punkish

I realize that, per Pg docs (http://www.postgresql.org/about/), one can store an unlimited number of rows in a table. However, what is the "rule of thumb" for usable number of rows, if any?

我意识到,根据 Pg 文档(http://www.postgresql.org/about/),可以在表中存储无限数量的行。但是,如果有的话,可用行数的“经验法则”是什么?

Background: I want to store daily readings for a couple of decades for 13 million cells. That works out to 13 M * (366|365) * 20 ~ 9.5e10, or 95 B rows (in reality, around 120 B rows).

背景:我想为 1300 万个细胞存储几十年的每日读数。结果是 13 M * (366|365) * 20 ~ 9.5e10,或 95 B 行(实际上,大约 120 B 行)。

So, using table partitioning, I set up a master table, and then inherited tables by year. That divvies up the rows to ~ 5.2 B rows per table.

所以,使用表分区,我设置了一个主表,然后按年份继承表。这将行划分为每个表约 5.2 B 行。

Each row is 9 SMALLINTs, and two INTs, so, 26 bytes. Add to that, the Pg overhead of 23 bytes per row, and we get 49 bytes per row. So, each table, without any PK or any other index, will weigh in at ~ 0.25 TB.

每行是 9 个 SMALLINT 和两个 INT,因此,26 个字节。再加上每行 23 个字节的 Pg 开销,我们得到每行 49 个字节。因此,没有任何 PK 或任何其他索引的每个表的权重约为 0.25 TB。

For starters, I have created only a subset of the above data, that is, only for about 250,000 cells. I have to do a bunch of tuning (create proper indexes, etc.), but the performance is really terrible right now. Besides, every time I need to add more data, I will have to drop the keys and the recreate them. The saving grace is that once everything is loaded, it will be a readonly database.

首先,我只创建了上述数据的一个子集,即大约 250,000 个单元格。我必须进行大量调整(创建适当的索引等),但现在的性能非常糟糕。此外,每次我需要添加更多数据时,我都必须删除密钥并重新创建它们。可取之处在于,一旦加载了所有内容,它将成为只读数据库。

Any suggestions? Any other strategy for partitioning?

有什么建议?还有其他分区策略吗?

回答by Konrad Garus

It's not just "a bunch of tuning (indexes etc.)". This is crucial and a must do.

这不仅仅是“一堆调整(索引等)”。这是至关重要的,也是必须要做的。

You posted few details, but let's try.

您发布了一些详细信息,但让我们尝试一下。

The rule is: Try and find the most common working set. See if it fits in RAM. Optimize hardware, PG/OS buffer settings and PG indexes/clustering for it. Otherwise look for aggregates, or if it's not acceptable and you need fully random access, think what hardware could scan the whole table for you in reasonable time.

规则是:尝试找到最常见的工作集。看看它是否适合 RAM。为其优化硬件、PG/OS 缓冲区设置和 PG 索引/集群。否则寻找聚合,或者如果它不可接受并且您需要完全随机访问,请考虑什么硬件可以在合理的时间内为您扫描整个表。

How large is your table (in gigabytes)? How does it compare to total RAM? What are your PG settings, including shared_buffers and effective_cache_size? Is this a dedicated server? If you have a 250-gig table and about 10 GB of RAM, it means you can only fit 4% of the table.

您的表有多大(以 GB 为单位)?它与总 RAM 相比如何?你的 PG 设置是什么,包括 shared_buffers 和 Effective_cache_size?这是专用服务器吗?如果您有一个 250-gig 的表和大约 10 GB 的 RAM,这意味着您只能容纳表的 4%。

Are there any columns which are commonly used for filtering, such as state or date? Can you the working set that is most commonly used (like only last month)? If so, consider partitioning or clustering on these columns, and definitely index them. Basically, you're trying to make sure that as much of the working set as possible fits in RAM.

是否有常用于过滤的列,例如状态或日期?你能不能把最常用的工作集(就像上个月一样)?如果是这样,请考虑对这些列进行分区或聚类,并确定对它们进行索引。基本上,您要确保尽可能多的工作集适合 RAM。

Avoid scanning the table at all costs if it does not fit in RAM. If the you really need absolutely random access, the only way it could be usable is really sophisticated hardware. You would need a persistent storage/RAM configuration which can read 250 GB in reasonable time.

如果它不适合 RAM,则不惜一切代价避免扫描表。如果您真的需要绝对随机访问,那么唯一可用的方法是非常复杂的硬件。您需要一个可以在合理时间内读取 250 GB 的持久存储/RAM 配置。