postgresql PostgreSQL中计算和节省空间

Question

提问by punkish

I have a table in pg like so:

我在 pg 中有一张表，如下所示：

CREATE TABLE t (
    a BIGSERIAL NOT NULL,               -- 8 b
    b SMALLINT,                         -- 2 b
    c SMALLINT,                         -- 2 b
    d REAL,                             -- 4 b
    e REAL,                             -- 4 b
    f REAL,                             -- 4 b
    g INTEGER,                          -- 4 b
    h REAL,                             -- 4 b
    i REAL,                             -- 4 b
    j SMALLINT,                         -- 2 b
    k INTEGER,                          -- 4 b
    l INTEGER,                          -- 4 b
    m REAL,                             -- 4 b
    CONSTRAINT a_pkey PRIMARY KEY (a)
);

The above adds up to 50 bytes per row. My experience is that I need another 40% to 50% for system overhead, without even any user-created indexes to the above. So, about 75 bytes per row. I will have many, many rows in the table, potentially upward of 145 billion rows, so the table is going to be pushing 13-14 terabytes. What tricks, if any, could I use to compact this table? My possible ideas below ...

以上每行加起来最多 50 个字节。我的经验是，我还需要 40% 到 50% 的系统开销，甚至没有任何用户创建的上述索引。因此，每行大约 75 个字节。我将在表中有很多很多行，可能超过 1450 亿行，因此该表将推动 13-14 TB。如果有的话，我可以使用什么技巧来压缩这张桌子？我可能的想法如下......

Convert the realvalues to integer. If they can stored as smallint, that is a saving of 2 bytes per field.

将real值转换为integer. 如果它们可以存储为smallint，则每个字段可节省 2 个字节。

Convert the columns b .. m into an array. I don't need to search on those columns, but I do need to be able to return one column's value at a time. So, if I need column g, I could do something like

将列 b .. m 转换为数组。我不需要搜索这些列，但我确实需要能够一次返回一列的值。所以，如果我需要 g 列，我可以做类似的事情

SELECT a, arr[5] FROM t;

Would I save space with the array option? Would there be a speed penalty?

我会用数组选项节省空间吗？会有速度惩罚吗？

Any other ideas?

还有其他想法吗？

Answer 1

采纳答案by leonbloy

I see nothing to gain (and something to lose) in storing several numeric fields in an array.

我认为在数组中存储多个数字字段没有任何好处（也没有什么损失）。

The size of each numerical typeis clearly documented, you should simply use the smallest sized type compatible with your desired range-resolution; and that's about all you can do.

每个数字类型的大小都有明确的记录，您应该只使用与您所需的范围分辨率兼容的最小大小的类型；这就是你所能做的。

I don't think (but I'm not sure) if there is some byte alignment requirement for the columns along a row, in that case a reordering of the columns could alter the space used - but I don't think so.

我不认为（但我不确定）一行中的列是否有一些字节对齐要求，在这种情况下，列的重新排序可能会改变使用的空间 - 但我不这么认为。

BTW, there is a fix overhead per row, about 23 bytes.

顺便说一句，每行有一个固定的开销，大约23 个字节。

Answer 2

回答by Erwin Brandstetter

"Column Tetris"

《列俄罗斯方块》

Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements.

其实你可以做点什么，但这需要更深入的理解。关键字是对齐填充。每种数据类型都有特定的对齐要求。

You can minimize space lost to padding between columnsby ordering them favorably. The following (extreme) example would waste a lot of physical disk space:

您可以通过有利地对列之间的填充进行排序来最大程度地减少因填充而损失的空间。以下（极端）示例会浪费大量物理磁盘空间：

CREATE TABLE t (
    e int2    -- 6 bytes of padding after int2
  , a int8
  , f int2    -- 6 bytes of padding after int2
  , b int8
  , g int2    -- 6 bytes of padding after int2
  , c int8
  , h int2    -- 6 bytes of padding after int2
  , d int8)

To save 24 bytesper row, use instead:

要每行节省24 个字节，请改用：

CREATE TABLE t (
    a int8
  , b int8
  , c int8
  , d int8
  , e int2
  , f int2
  , g int2
  , h int2)   -- 4 int2 occupy 8 byte (MAXALIGN), no padding at the end

SQL Fiddle.

SQL小提琴。

As a rule of thumb, if you put 8-byte columns first, then 4-bytes, 2-bytes and 1-byte columns last you can't go wrong.

根据经验，如果您首先放置 8 字节的列，然后是 4 字节、2 字节和 1 字节的列，那么您不会出错。

boolean, uuidand a few other types need no alignment padding. text, varcharand other "varlena" (variable length) types nominallyrequire "int" alignment (4 bytes on most machines). But in fact there is no alignment padding in disk format (unlike in RAM). I verified in many tests. Eventually, I found the explanation in a note in the source code:

boolean，uuid还有一些其他类型不需要对齐填充。text,varchar和其他“varlena”（可变长度）类型名义上需要“int”对齐（在大多数机器上为 4 个字节）。但实际上在磁盘格式中没有对齐填充（不像在 RAM 中）。我在很多测试中验证过。最终，我在源代码的注释中找到了解释：

Note also that we allow the nominal alignment to be violated when storing "packed" varlenas;

另请注意，我们允许在存储“打包”varlenas 时违反标称对齐；

Normally, you may save a couple of bytes per row at best playing "column tetris". None of this is necessary in most cases. But with billions of rows it can mean a couple of gigabytes easily.

通常，您最多可以在播放“列俄罗斯方块”时每行节省几个字节。在大多数情况下，这些都不是必需的。但是对于数十亿行，它可以很容易地意味着几 GB。

You can test the actual column / row size with the function pg_column_size().
Some types occupy more space in RAM than on disk (compressed or "packed" format). You can get bigger results for constants (RAM format) than for table columns when testing the same value (or row of values vs. table row) with pg_column_size().

您可以使用该函数测试实际的列/行大小pg_column_size()。
某些类型在 RAM 中比在磁盘上占用更多空间（压缩或“打包”格式）。使用pg_column_size().

Finally, some types can be compressed or "toasted"(stored out of line) or both.

最后，某些类型可以被压缩或“烘烤”（存储在行外）或两者兼而有之。

Overhead per tuple (row)

每个元组的开销（行）

4 bytes per row for the item identifier - not subject to above considerations.
And at least 24 bytes (23 + padding) for the tuple header. The manual on Database Page Layout:

项目标识符每行 4 个字节 - 不受上述考虑因素的影响。
元组标头至少有 24 个字节（23 + 填充）。数据库页面布局手册：

There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data.

有一个固定大小的头（在大多数机器上占用 23 个字节），后面是一个可选的空位图、一个可选的对象 ID 字段和用户数据。

For the padding between header and user data, you need to know MAXALIGNon your server - typically 8 bytes on a 64-bit OS (or 4 bytes on a 32-bit OS). If you are not sure, check out pg_controldata.

对于标头和用户数据之间的填充，您需要MAXALIGN在服务器上知道- 通常在 64 位操作系统上为 8 个字节（或在 32 位操作系统上为 4 个字节）。如果您不确定，请查看pg_controldata。

Run the following in your Postgres binary dirto get a definitive answer:

在Postgres 二进制目录中运行以下命令以获得明确的答案：

./pg_controldata /path/to/my/dbcluster

The manual:

手册：

The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGNdistance for the platform.

实际用户数据（行的列）从指示的偏移量开始，该偏移量t_hoff必须始终是MAXALIGN平台距离的倍数。

So you typically get the storage optimum by packing data in multiples of 8 bytes.

因此，您通常通过以 8 字节的倍数打包数据来获得最佳存储。

There is nothing to gain in the example you posted. It's already packed tightly. 2 bytes of padding after the last int2, 4 bytes at the end. You could consolidate the padding to 6 bytes at the end, which wouldn't change anything.

您发布的示例没有任何好处。已经包得严严实实了。最后 2 个字节填充，最后int24 个字节。您可以在最后将填充合并为 6 个字节，这不会改变任何内容。

Overhead per data page

每个数据页的开销

Data page size is typically 8 KB. Some overhead / bloat at this level, too: Remainders not big enough to fit another tuple, and more importantly dead rows or a percentage reserved with the FILLFACTORsetting.

数据页大小通常为 8 KB。在这个级别上也有一些开销/膨胀：余数不足以容纳另一个元组，更重要的是死行或FILLFACTOR设置保留的百分比。

There are a couple of other factors for size on disk to take into account:

磁盘大小还有几个其他因素需要考虑：

Array types?

数组类型？

With an arraytype like you were evaluating, you would add 24 bytes of overheadfor the type. Plus, array elements occupy space as usual. Nothing to gain there.

对于您正在评估的数组类型，您将为该类型添加24 字节的开销。另外，数组元素像往常一样占用空间。没有什么可在那里获得的。

Answer 3

回答by jboxxx

From this great documentation: https://www.2ndquadrant.com/en/blog/on-rocks-and-sand/

从这个伟大的文档：https: //www.2ndquadrant.com/en/blog/on-rocks-and-sand/

For a table you already have, or perhaps one you're making in development, named my_table, this query will give the optimal order left to right.

对于您已经拥有的表，或者您正在开发的表，命名为my_table，此查询将给出从左到右的最佳顺序。

SELECT a.attname, t.typname, t.typalign, t.typlen
FROM pg_class c
JOIN pg_attribute a ON (a.attrelid = c.oid)
JOIN pg_type t ON (t.oid = a.atttypid)
WHERE c.relname = 'my_table'
 AND a.attnum >= 0
ORDER BY t.typlen DESC

postgresql PostgreSQL中计算和节省空间

提问by punkish

采纳答案by leonbloy

回答by Erwin Brandstetter

"Column Tetris"

《列俄罗斯方块》

Overhead per tuple (row)

每个元组的开销（行）

Overhead per data page

每个数据页的开销

Array types?

数组类型？

回答by jboxxx

相关推荐

最近更新

标签

postgresql PostgreSQL中计算和节省空间

提问by punkish

采纳答案by leonbloy

回答by Erwin Brandstetter

"Column Tetris"

《列俄罗斯方块》

Overhead per tuple (row)

每个元组的开销（行）

Overhead per data page

每个数据页的开销

Array types?

数组类型？

回答by jboxxx

相关推荐

PostgreSQL：表创建时间

如何为 postgreSQL 中的现有表生成“创建表”sql 语句

如何找到 Postgres / PostgreSQL 表及其索引的磁盘大小

postgresql 你如何在 Postgres 中找到所有表的行数

相关推荐

最近更新

标签