postgresql Postgres pg_dump 每次都以不同的顺序转储数据库

Question

提问by littleK

I am writing a PHP script (which also uses linux bash commands) which will run through test cases by doing the following:

我正在编写一个 PHP 脚本（它也使用 linux bash 命令），它将通过执行以下操作来运行测试用例：

I am using a PostgreSQL database (8.4.2)...

我正在使用 PostgreSQL 数据库 (8.4.2)...

1.) Create a DB 2.) Modify the DB 3.) Store a database dump of the DB (pg_dump)

1.) 创建 DB 2.) 修改 DB 3.) 存储 DB 的数据库转储 (pg_dump)

4.) Do regression testing by doing steps 1.) and 2.), and then take another database dump and compare it (diff) with the original database dump from step number 3.)

4.) 通过执行步骤 1.) 和 2.) 进行回归测试，然后进行另一个数据库转储并将其与步骤 3 中的原始数据库转储进行比较（差异）。）

However, I am finding that pg_dump will not always dump the database in the same way. It will dump things in a different order every time. Therefore, when I do a diff on the two database dumps, the comparison will result in the two files being different, when they are actually the same, just in a different order.

但是，我发现 pg_dump 并不总是以相同的方式转储数据库。它每次都会以不同的顺序转储东西。因此，当我对两个数据库转储进行 diff 时，比较将导致两个文件不同，而实际上它们相同，只是顺序不同。

Is there a different way I can go about doing the pg_dump?

有没有其他方法可以做 pg_dump？

Thanks!

谢谢！

Answer 1

采纳答案by Tometzky

It is impossible to force pg_dump to dump data in any particular order, as it dumps data in disk order - it is much faster this way.

强制 pg_dump 以任何特定顺序转储数据是不可能的，因为它以磁盘顺序转储数据 - 这种方式要快得多。

You can use "-a -d" options for pg_dump and then "sort" output, but newlines in data will make sorted output unusable. But for basic comparison, whether anything changed, it would suffice.

您可以对 pg_dump 使用“-a -d”选项，然后对输出使用“排序”，但数据中的换行符将使排序输出不可用。但是对于基本的比较，不管有什么变化，就足够了。

Answer 2

回答by akaihola

Here is a handy script for pre-processing pg_dumpoutput to make it more suitable for diffing and storing in version control:

这是一个用于预处理pg_dump输出的方便脚本，使其更适合版本控制中的差异和存储：

https://github.com/akaihola/pgtricks

pg_dump_splitsort.pysplits the dump into the following files:

pg_dump_splitsort.py将转储拆分为以下文件：

0000_prologue.sql: everything up to the first COPY
0001_<schema>.<table>.sql
.
.
NNNN_<schema>.<table>.sql: data for each table sorted by the first field
9999_epilogue.sql: everything after the last COPY

0000_prologue.sql: 直到第一个 COPY
0001_<schema>.<table>.sql
.
.
NNNN_<schema>.<table>.sql: 每个表的数据按第一个字段排序
9999_epilogue.sql: 最后一次 COPY 之后的所有内容

The files for table data are numbered so a simple sorted concatenation of all files can be used to re-create the database:

表数据的文件已编号，因此可以使用所有文件的简单排序串联来重新创建数据库：

$ cat *.sql | psql <database>

I've found that a good way to take a quick look at differences between dumps is to use the meldtool on the whole directory:

我发现快速查看转储之间差异的一个好方法是meld在整个目录上使用该工具：

$ meld old-dump/ new-dump/

Storing the dump in version control also gives a decent view on the differences. Here's how to configure git to use color in diffs:

将转储存储在版本控制中也可以很好地了解差异。以下是如何配置 git 以在差异中使用颜色：

# ~/.gitconfig
[color]
        diff = true
[color "diff"]
        frag = white blue bold
        meta = white green bold
        commit = white red bold

Note:If you have created/dropped/renamed tables, remember to delete all .sqlfiles before post-processing the new dump.

注意：如果您已创建/删除/重命名表，请记住.sql在对新转储进行后处理之前删除所有文件。

Answer 3

回答by Peter Eisentraut

It's worth distinguishing schema and data here. The schema is dumped in a fairly deterministic order, most objects alphabetically, constrained by inter-object dependencies. There are some limited cases where the order is not fully constrained and may appear random to an outside observer, but that may get fixed in the next version.

值得在这里区分模式和数据。模式以相当确定的顺序转储，大多数对象按字母顺序排列，受对象间依赖关系的约束。在某些有限的情况下，顺序没有完全受到约束，并且在外部观察者看来可能是随机的，但这可能会在下一个版本中得到修复。

The data on the other hand is dumped in disk order. This is usually what you want, because you want dumps to be fast and not use insane amounts of resources to do sorting. What you might be observing is that when you "modify the DB" you are doing an UPDATE, which will actually delete the old value and append the new value at the end. And that will of course upset your diff strategy.

另一方面，数据按磁盘顺序转储。这通常是您想要的，因为您希望转储速度快，而不是使用大量资源进行排序。您可能会观察到，当您“修改数据库”时，您正在执行 UPDATE，这实际上会删除旧值并在末尾附加新值。这当然会扰乱你的差异策略。

A tool that might be more suitable for your purpose is pg_comparator.

可能更适合您的目的的工具是pg_comparator。

Answer 4

回答by zifot

As of may 2010 a patch to pg_dumpexists that may be helpful to all interested in this matter - it adds "--ordered" option to this utility:

截至 2010 年 5 月，pg_dump的补丁已经存在，可能对所有对此问题感兴趣的人都有帮助——它为此实用程序添加了“--ordered”选项：

Using --ordered will order the data by primary key or unique index, if one exists, and use the "smallest" ordering (i.e. least number of columns required for a unique order).
Note that --ordered could crush your database server if you try to order very large tables, so use judiciously.

使用 --ordered 将按主键或唯一索引（如果存在）对数据进行排序，并使用“最小”排序（即唯一顺序所需的最少列数）。
请注意，如果您尝试订购非常大的表， --ordered 可能会破坏您的数据库服务器，因此请谨慎使用。

I didn't test it, but I guess it's worth a try.

我没有测试它，但我想它值得一试。

Answer 5

回答by cope360

If you are just interested in the schema:

如果您只对架构感兴趣：

You could do your diff table by table-by-using a combination of these options to dump the schema for only one table at a time. You could then compare them individually or cat them all to one file in a known order.

您可以通过使用这些选项的组合逐个表来制作差异表，以一次仅转储一个表的架构。然后，您可以单独比较它们或以已知顺序将它们全部归入一个文件。

-s, --schema-only           dump only the schema, no data
-t, --table=TABLE           dump the named table(s) only

To generate the list of tables to feed to the above, query information_schema.tables.

要生成要提供给上述内容的表列表，请查询information_schema.tables.

Answer 6

回答by Daniel Brückner

It is not unusual that PostgreSQL behaves nondeterministically - maybe timer triggered reorganization processes or something like that occur in the background. Further I am not aware of a way to force pg_dump to reproduce a bit-identical output on successive runs.

PostgreSQL 的非确定性行为并不罕见 - 可能是计时器触发的重组过程或类似的事情发生在后台。此外，我不知道有什么方法可以强制 pg_dump 在连续运行中重现位相同的输出。

I suggest to change your comparison logic because it is your comparison that is misbehaved - it reports differences while both dumps represent the same database state. This of course means some additional work but is in my opinion the correct way to attack the problem.

我建议更改您的比较逻辑，因为是您的比较行为不当 - 它报告差异，而两个转储都表示相同的数据库状态。这当然意味着一些额外的工作，但在我看来是解决问题的正确方法。

Answer 7

回答by Risadinha

If performance is less important than order you could use:

如果性能不如顺序重要，您可以使用：

COPY (select * from your_table order by some_col) to stdout
      with csv header delimiter ',';

See COPY (9.5)

参见复制 (9.5)

postgresql Postgres pg_dump 每次都以不同的顺序转储数据库

提问by littleK

采纳答案by Tometzky

回答by akaihola

回答by Peter Eisentraut

回答by zifot

回答by cope360

回答by Daniel Brückner

回答by Risadinha

相关推荐

最近更新

标签

postgresql Postgres pg_dump 每次都以不同的顺序转储数据库

提问by littleK

采纳答案by Tometzky

回答by akaihola

回答by Peter Eisentraut

回答by zifot

回答by cope360

回答by Daniel Brückner

回答by Risadinha

相关推荐

PostgreSQL ，从 2 个表中选择，但只有表 2 中的最新元素

如何使用 spring JdbcTemplate 更新 postgresql 数组列？

postgresql 包含非 EF 方法调用的 Select 子句

postgresql 处理并发问题的最佳方式

相关推荐

最近更新

标签