SQL 优化 PostgreSQL 以进行快速测试

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9407442/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 14:30:57  来源:igfitidea点击:

Optimise PostgreSQL for fast testing

sqldatabaseperformancepostgresqldatabase-tuning

提问by Dmytrii Nagirniak

I am switching to PostgreSQL from SQLite for a typical Rails application.

我正在为典型的 Rails 应用程序从 SQLite 切换到 PostgreSQL。

The problem is that running specs became slow with PG.
On SQLite it took ~34 seconds, on PG it's ~76 seconds which is more than 2x slower.

问题是使用 PG 运行规范变得缓慢。
在 SQLite 上花费了大约 34 秒,在 PG 上花费了大约 76 秒,这比慢了 2 倍多

So now I want to apply some techniques to bring the performance of the specs on par with SQLitewith no code modifications (ideally just by setting the connection options, which is probably not possible).

所以现在我想应用一些技术来使规范的性能与 SQLite 相提并论,而无需修改代码(理想情况下只是通过设置连接选项,这可能是不可能的)。

Couple of obvious things from top of my head are:

从我头顶上看,有几件明显的事情是:

  • RAM Disk (good setup with RSpec on OSX would be good to see)
  • Unlogged tables (can it be applied on the whole database so I don't have change all the scripts?)
  • RAM 磁盘(在 OSX 上使用 RSpec 进行良好设置会很高兴)
  • 未记录的表(它可以应用于整个数据库,所以我不必更改所有脚本?)

As you may have understood I don't care about reliability and the rest (the DB is just a throwaway thingy here).
I need to get the most out of the PG and make it as fast as it can possibly be.

正如您可能已经理解的那样,我不关心可靠性和其他(DB 在这里只是一个一次性的东西)。
我需要充分利用 PG 并使其尽可能快

Best answerwould ideally describe the tricksfor doing just that, setup and the drawbacks of those tricks.

最佳答案将理想地描述这样做的技巧、设置和这些技巧的缺点。

UPDATE:fsync = off+ full_page_writes = offonly decreased time to ~65 seconds (~-16 secs). Good start, but far from the target of 34.

更新:fsync = off+full_page_writes = off仅将时间减少到 ~65 秒(~-16 秒)。良好的开端,但离目标 34 还差得很远。

UPDATE 2:I tried to use RAM diskbut the performance gain was within an error margin. So doesn't seem to be worth it.

更新 2:尝试使用 RAM 磁盘,但性能提升在误差范围内。所以似乎不值得。

UPDATE 3:*I found the biggest bottleneck and now my specs run as fast as the SQLite ones.

更新 3:*我发现了最大的瓶颈,现在我的规格与 SQLite 的运行速度一样快。

The issue was the database cleanup that did the truncation. Apparently SQLite is way too fast there.

问题是执行截断的数据库清理。显然 SQLite 在那里太快了。

To "fix" it I open a transactionbefore each test and roll it back at the end.

为了“修复”它,我在每次测试之前打开一个事务并在最后回滚它。

Some numbers for ~700 tests.

大约 700 次测试的一些数字。

  • Truncation: SQLite - 34s, PG - 76s.
  • Transaction: SQLite - 17s, PG - 18s.
  • 截断:SQLite - 34s,PG - 76s。
  • 事务:SQLite - 17s,PG - 18s。

2x speed increase for SQLite. 4x speed increase for PG.

SQLite 速度提高 2 倍。PG 速度提高 4 倍。

回答by Craig Ringer

First, always use the latest version of PostgreSQL. Performance improvements are always coming, so you're probably wasting your time if you're tuning an old version. For example, PostgreSQL 9.2 significantly improves the speed of TRUNCATEand of course adds index-only scans. Even minor releases should always be followed; see the version policy.

首先,始终使用最新版本的 PostgreSQL。性能改进总是会到来,因此如果您正在调整旧版本,则可能是在浪费时间。例如,PostgreSQL 9.2 显着提高了速度TRUNCATE,当然还增加了仅索引扫描。即使是次要版本也应始终遵循;请参阅版本政策

Don'ts

不要

Do NOTput a tablespace on a RAMdisk or other non-durable storage.

千万不要把表放在ramdisk上或其他非耐久存

If you lose a tablespace the whole database may be damaged and hard to use without significant work. There's very little advantage to this compared to just using UNLOGGEDtables and having lots of RAM for cache anyway.

如果您丢失了一个表空间,整个数据库可能会损坏并且在没有大量工作的情况下难以使用。与仅使用UNLOGGED表并拥有大量 RAM 用于缓存相比,这几乎没有什么优势。

If you truly want a ramdisk based system, initdba whole new cluster on the ramdisk by initdbing a new PostgreSQL instance on the ramdisk, so you have a completely disposable PostgreSQL instance.

如果你真的想要一个基于 ramdisk 的系统,initdb通过在 ramdisk 上创建initdb一个新的 PostgreSQL 实例,在 ramdisk上建立一个全新的集群,这样你就有了一个完全一次性的 PostgreSQL 实例。

PostgreSQL server configuration

PostgreSQL 服务器配置

When testing, you can configure your server for non-durable but faster operation.

测试时,您可以将服务器配置为非持久但运行速度更快

This is one of the only acceptable uses for the fsync=offsetting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.

这是fsync=offPostgreSQL 中设置的唯一可接受的用途之一。这个设置几乎告诉 PostgreSQL 不要打扰有序的写入或任何其他讨厌的数据完整性保护和崩溃安全的东西,如果你断电或操作系统崩溃,它允许完全破坏你的数据。

Needless to say, you should never enable fsync=offin production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writesoff, as it no longer does any good then. Beware that fsync=offand full_page_writesapply at the clusterlevel, so they affect alldatabases in your PostgreSQL instance.

不用说,fsync=off除非您将 Pg 用作可以从其他地方重新生成的数据的临时数据库,否则永远不应在生产中启用。当且仅当您关闭 fsync 时也可以full_page_writes关闭,因为它不再有任何好处。请注意fsync=offfull_page_writes集群级别应用,因此它们会影响PostgreSQL 实例中的所有数据库。

For production use you can possibly use synchronous_commit=offand set a commit_delay, as you'll get many of the same benefits as fsync=offwithout the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.

对于生产用途,您可以使用synchronous_commit=off和设置 a commit_delay,因为您将获得许多与fsync=off没有巨大数据损坏风险相同的好处。如果您启用异步提交,您确实会丢失最近数据的一小部分 - 仅此而已。

If you have the option of slightly altering the DDL, you can also use UNLOGGEDtables in Pg 9.1+ to completely avoid WAL logging and gain a real speed boost at the cost of the tables getting erased if the server crashes. There is no configuration option to make all tables unlogged, it must be set during CREATE TABLE. In addition to being good for testing this is handy if you have tables full of generated or unimportant data in a database that otherwise contains stuff you need to be safe.

如果您可以选择稍微更改 DDL,您还可以使用UNLOGGEDPg 9.1+ 中的表来完全避免 WAL 日志记录,并以在服务器崩溃时表被擦除为代价获得真正的速度提升。没有配置选项可以取消所有表的记录,必须在CREATE TABLE. 除了有利于测试之外,如果您的数据库中包含充满生成或不重要的数据的表,而这些表包含您需要安全的内容,则这很方便。

Check your logs and see if you're getting warnings about too many checkpoints. If you are, you should increase your checkpoint_segments. You may also want to tune your checkpoint_completion_target to smooth writes out.

检查您的日志,看看您是否收到有关检查点过多的警告。如果是,则应该增加checkpoint_segments。您可能还想调整 checkpoint_completion_target 以平滑写出。

Tune shared_buffersto fit your workload. This is OS-dependent, depends on what else is going on with your machine, and requires some trial and error. The defaults are extremely conservative. You may need to increase the OS's maximum shared memory limit if you increase shared_bufferson PostgreSQL 9.2 and below; 9.3 and above changed how they use shared memory to avoid that.

调整shared_buffers以适应您的工作量。这取决于操作系统,取决于您的机器正在发生的其他事情,并且需要一些反复试验。默认值非常保守。如果您shared_buffers在 PostgreSQL 9.2 及以下版本上增加,您可能需要增加操作系统的最大共享内存限制;9.3 及更高版本更改了他们使用共享内存的方式以避免这种情况。

If you're using a just a couple of connections that do lots of work, increase work_memto give them more RAM to play with for sorts etc. Beware that too high a work_memsetting can cause out-of-memory problems because it's per-sort not per-connection so one query can have many nested sorts. You only reallyhave to increase work_memif you can see sorts spilling to disk in EXPLAINor logged with the log_temp_filessetting(recommended), but a higher value may also let Pg pick smarter plans.

如果你只使用几个做大量工作的连接,增加work_mem它们以给它们更多的 RAM 用于排序等。请注意,work_mem设置太高会导致内存不足问题,因为它不是每个排序每个连接,因此一个查询可以有许多嵌套排序。你只有真正有提高work_mem,如果你能看到各种各样的溢出到磁盘EXPLAIN或登录与log_temp_files设置(推荐),但具有较高的价值也可以让皮克挑聪明的计划。

As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=offand almost none if you're using UNLOGGEDtables.

正如这里的另一位海报所说,如果可能,将 xlog 和主表/索引放在单独的 HDD 上是明智的。单独的分区毫无意义,您确实需要单独的驱动器。如果您正在运行,这种分离的好处要少得多fsync=off,如果您使用UNLOGGED表,则几乎没有好处。

Finally, tune your queries. Make sure that your random_page_costand seq_page_costreflect your system's performance, ensure your effective_cache_sizeis correct, etc. Use EXPLAIN (BUFFERS, ANALYZE)to examine individual query plans, and turn the auto_explainmodule on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.

最后,调整您的查询。确保您random_page_costseq_page_cost反映您的系统性能,确保您effective_cache_size的正确等。EXPLAIN (BUFFERS, ANALYZE)用于检查单个查询计划,并打开该auto_explain模块以报告所有慢查询。您通常可以通过创建适当的索引或调整成本参数来显着提高查询性能。

AFAIK there's no way to set an entire database or cluster as UNLOGGED. It'd be interesting to be able to do so. Consider asking on the PostgreSQL mailing list.

AFAIK 无法将整个数据库或集群设置为UNLOGGED. 能够这样做会很有趣。考虑在 PostgreSQL 邮件列表上询问。

Host OS tuning

主机操作系统调优

There's some tuning you can do at the operating system level, too. The main thing you might want to do is convince the operating system not to flush writes to disk aggressively, since you really don't care when/if they make it to disk.

您也可以在操作系统级别进行一些调整。您可能想要做的主要事情是说服操作系统不要积极地将写入刷新到磁盘,因为您真的不关心它们何时/是否写入磁盘。

In Linux you can control this with the virtual memory subsystem's dirty_*settings, like dirty_writeback_centisecs.

在 Linux 中,您可以使用虚拟内存子系统dirty_*设置来控制它,例如dirty_writeback_centisecs.

The only issue with tuning writeback settings to be too slack is that a flush by some other program may cause all PostgreSQL's accumulated buffers to be flushed too, causing big stalls while everything blocks on writes. You may be able to alleviate this by running PostgreSQL on a different file system, but some flushes may be device-level or whole-host-level not filesystem-level, so you can't rely on that.

调整回写设置过于松弛的唯一问题是其他程序的刷新可能会导致所有 PostgreSQL 累积的缓冲区也被刷新,从而导致大停顿,而所有内容都在写入时被阻塞。您可以通过在不同的文件系统上运行 PostgreSQL 来缓解这种情况,但某些刷新可能是设备级别或整个主机级别而不是文件系统级别,因此您不能依赖它。

This tuning really requires playing around with the settings to see what works best for your workload.

这种调整确实需要调整设置以查看最适合您的工作负载的设置。

On newer kernels, you may wish to ensure that vm.zone_reclaim_modeis set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.

在较新的内核上,您可能希望确保将vm.zone_reclaim_mode其设置为零,因为由于与 PostgreSQL 管理shared_buffers.

Query and workload tuning

查询和工作负载调优

These are things that DO require code changes; they may not suit you. Some are things you might be able to apply.

这些是确实需要更改代码的事情;他们可能不适合你。有些是你可以申请的。

If you're not batching work into larger transactions, start. Lots of small transactions are expensive, so you should batch stuff whenever it's possible and practical to do so. If you're using async commit this is less important, but still highly recommended.

如果您不将工作批处理为更大的事务,请开始。许多小额交易的成本很高,因此您应该在可能且可行的情况下进行批处理。如果您使用异步提交,这不太重要,但仍然强烈推荐。

Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ...to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).

尽可能使用临时表。它们不生成 WAL 流量,因此插入和更新的速度要快得多。有时值得将一堆数据放入临时表中,根据需要对其进行操作,然后将INSERT INTO ... SELECT ...其复制到最终表中。请注意,临时表是针对每个会话的;如果您的会话结束或您失去连接,那么临时表就会消失,并且没有其他连接可以看到会话临时表的内容。

If you're using PostgreSQL 9.1 or newer you can use UNLOGGEDtables for data you can afford to lose, like session state. These are visible across different sessions and preserved between connections. They get truncated if the server shuts down uncleanly so they can't be used for anything you can't re-create, but they're great for caches, materialized views, state tables, etc.

如果您使用的是 PostgreSQL 9.1 或更高版本,则可以使用UNLOGGED表来存储您可以承受的丢失数据,例如会话状态。这些在不同的会话中可见并在连接之间保留。如果服务器不正常关闭,它们会被截断,因此它们不能用于您无法重新创建的任何内容,但它们非常适合缓存、物化视图、状态表等。

In general, don't DELETE FROM blah;. Use TRUNCATE TABLE blah;instead; it's a lot quicker when you're dumping all rows in a table. Truncate many tables in one TRUNCATEcall if you can. There's a caveat if you're doing lots of TRUNCATESof small tables over and over again, though; see: Postgresql Truncation speed

一般来说,不要DELETE FROM blah;。使用TRUNCATE TABLE blah;代替; 当您将所有行转储到表中时,速度要快得多。TRUNCATE如果可以,在一次调用中截断多个表。但是,如果您TRUNCATES一遍又一遍地制作许多小桌子,则需要注意;参见:Postgresql 截断速度

If you don't have indexes on foreign keys, DELETEs involving the primary keys referenced by those foreign keys will be horribly slow. Make sure to create such indexes if you ever expect to DELETEfrom the referenced table(s). Indexes are not required for TRUNCATE.

如果您没有外键索引,则DELETE涉及这些外键引用的主键的 s 将非常慢。如果您希望DELETE从引用的表中创建此类索引,请确保创建此类索引。不需要索引TRUNCATE

Don't create indexes you don't need. Each index has a maintenance cost. Try to use a minimal set of indexes and let bitmap index scans combine them rather than maintaining too many huge, expensive multi-column indexes. Where indexes are required, try to populate the table first, then create indexes at the end.

不要创建不需要的索引。每个索引都有维护成本。尝试使用最少的索引集并让位图索引扫描将它们组合起来,而不是维护太多庞大、昂贵的多列索引。在需要索引的地方,尝试先填充表,然后在最后创建索引。

Hardware

硬件

Having enough RAM to hold the entire database is a huge win if you can manage it.

如果您可以管理它,拥有足够的 RAM 来保存整个数据库是一个巨大的胜利。

If you don't have enough RAM, the faster storage you can get the better. Even a cheap SSD makes a massive difference over spinning rust. Don't trust cheap SSDs for production though, they're often not crashsafe and might eat your data.

如果您没有足够的 RAM,则存储速度越快越好。即使是便宜的 SSD 也会对旋转锈产生巨大的影响。不过,不要相信廉价的 SSD 用于生产,它们通常不安全,可能会吃掉您的数据。

Learning

学习

Greg Smith's book, PostgreSQL 9.0 High Performanceremains relevant despite referring to a somewhat older version. It should be a useful reference.

Greg Smith 的书PostgreSQL 9.0 High Performance尽管提到了一个较旧的版本,但仍然具有相关性。它应该是一个有用的参考。

Join the PostgreSQL general mailing list and follow it.

加入 PostgreSQL 通用邮件列表并关注它。

Reading:

读:

回答by mys

Use different disk layout:

使用不同的磁盘布局:

  • different disk for $PGDATA
  • different disk for $PGDATA/pg_xlog
  • different disk for tem files (per database $PGDATA/base//pgsql_tmp) (see note about work_mem)
  • $PGDATA 的不同磁盘
  • $PGDATA/pg_xlog 的不同磁盘
  • tem 文件的不同磁盘(每个数据库 $PGDATA/base//pgsql_tmp)(请参阅关于 work_mem 的说明)

postgresql.conf tweaks:

postgresql.conf 调整:

  • shared_memory: 30% of available RAM but not more than 6 to 8GB. It seems to be better to have less shared memory (2GB - 4GB) for write intensive workloads
  • work_mem: mostly for select queries with sorts/aggregations. This is per connection setting and query can allocate that value multiple times. If data can't fit then disk is used (pgsql_tmp). Check "explain analyze" to see how much memory do you need
  • fsync and synchronous_commit: Default values are safe but If you can tolerate data lost then you can turn then off
  • random_page_cost: if you have SSD or fast RAID array you can lower this to 2.0 (RAID) or even lower (1.1) for SSD
  • checkpoint_segments: you can go higher 32 or 64 and change checkpoint_completion_target to 0.9. Lower value allows faster after-crash recovery
  • shared_memory:可用 RAM 的 30%,但不超过 6 到 8GB。对于写入密集型工作负载,使用更少的共享内存(2GB - 4GB)似乎更好
  • work_mem:主要用于带有排序/聚合的选择查询。这是每个连接设置,查询可以多次分配该值。如果数据无法容纳,则使用磁盘 (pgsql_tmp)。检查“解释分析”以查看您需要多少内存
  • fsync 和 synchronous_commit:默认值是安全的,但如果您可以容忍数据丢失,那么您可以关闭然后关闭
  • random_page_cost:如果您有 SSD 或快速 RAID 阵列,您可以将其降低到 2.0 (RAID) 甚至更低 (1.1) 的 SSD
  • checkpoint_segments:你可以提高 32 或 64 并将 checkpoint_completion_target 更改为 0.9。较低的值允许更快的碰撞后恢复