database 以可能丢失数据为代价提高 PostgreSQL 写入速度?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5131266/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Increase PostgreSQL write speed at the cost of likely data loss?
提问by Xeoncross
I love that PostgreSQL is crash resistant, as I don't want to spend time fixing a database. However, I'm sure there must be some things I can disable/modify so that inserts/updates will work faster even if I lose a couple recordsprior to a power-outage / crash. I'm not worried about a couple records - just the database as a whole.
我喜欢 PostgreSQL 的防崩溃功能,因为我不想花时间修复数据库。但是,我确信必须有一些我可以禁用/修改的东西,以便即使我在断电/崩溃之前丢失了几条记录,插入/更新也能更快地工作。我不担心几条记录 - 只是整个数据库。
I am trying to optimize PostgreSQL for large amounts of writes. It currently takes 22 minutes to insert 1 million rows which seems a bit slow.
我正在尝试针对大量写入优化 PostgreSQL。目前插入 100 万行需要 22 分钟,这似乎有点慢。
How can I speed up PostgreSQL writes?
如何加快 PostgreSQL 写入速度?
Some of the options I have looked into (like full_page_writes), seem to also run the risk of corrupting data which isn't something I want. I don't mind lost data - I just don't want corruption.
我研究过的一些选项(如 full_page_writes)似乎也冒着破坏数据的风险,这不是我想要的。我不介意丢失数据 - 我只是不想损坏。
Update 1
更新 1
Here is the table I am using - this since most of the tables will contain ints and small strings this "sample" table seems to be the best example of what I should expect.
这是我正在使用的表 - 这是因为大多数表将包含整数和小字符串,这个“示例”表似乎是我应该期待的最好的例子。
CREATE TABLE "user"
(
id serial NOT NULL,
username character varying(40),
email character varying(70),
website character varying(100),
created integer,
CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);
I have about 10 scripts each issuing 100,000 requests at a time using prepared statements. This is to simulate a real-life load my application will be giving the database. In my application each page has 1+ inserts.
我有大约 10 个脚本,每个脚本使用准备好的语句一次发出 100,000 个请求。这是为了模拟我的应用程序将为数据库提供的真实负载。在我的应用程序中,每个页面都有 1 个以上的插入。
Update 2
更新 2
I am using asynchronous commits already, because I have
我已经在使用异步提交了,因为我有
synchronous_commit = off
同步提交 = 关闭
in the main configuration file.
在主配置文件中。
回答by Greg Smith
1M records inserted in 22 minutes works out to be 758 records/second. Each INSERT here is an individual commit to disk, with both write-ahead log and database components to it eventually. Normally I expect that even good hardware with a battery-backed cache and everything you will be lucky to hit 3000 commit/second. So you're not actually doing too bad if this is regular hardware without such write acceleration. The normal limit here is in the 500 to 1000 commits/second range in the situation you're in, without special tuning for this situation.
在 22 分钟内插入 100 万条记录,结果为 758 条记录/秒。这里的每个 INSERT 都是对磁盘的单独提交,最终将预写日志和数据库组件都写入磁盘。通常,我希望即使是带有电池备份缓存的良好硬件,并且您会很幸运地达到 3000 次提交/秒。因此,如果这是没有这种写入加速的常规硬件,那么您实际上并不会做得太糟糕。在您所处的情况下,此处的正常限制在 500 到 1000 次提交/秒范围内,无需针对这种情况进行特殊调整。
As for what that would look like, if you can't make the commits include more records each, your options for speeding this up include:
至于那会是什么样子,如果你不能让提交包含更多的记录,你加快速度的选择包括:
Turn off synchronous_commit (already done)
Increase wal_writer_delay. When synchronous_commit is off, the database spools commits up to be written every 200ms. You can make that some number of seconds instead if you want to by tweaking this upwards, it just increases the size of data loss after a crash.
Increase wal_buffers to 16MB, just to make that whole operation more efficient.
Increase checkpoint_segments, to cut down on how often the regular data is written to disk. You probably want at least 64 here. Downsides are higher disk space use and longer recovery time after a crash.
Increase shared_buffers. The default here is tiny, typically 32MB. You have to increase how much UNIX shared memory the system has to allocate. Once that's done, useful values are typically >1/4 of total RAM, up to 8GB. The rate of gain here falls off above 256MB, the increase from the default to there can be really helpful though.
关闭 synchronous_commit(已经完成)
增加 wal_writer_delay。当 synchronous_commit 关闭时,数据库假脱机提交最多每 200 毫秒写入一次。如果您想通过向上调整它,您可以改为设置几秒钟,它只会增加崩溃后数据丢失的大小。
将 wal_buffers 增加到 16MB,只是为了提高整个操作的效率。
增加 checkpoint_segments,以减少常规数据写入磁盘的频率。您可能需要至少 64 位。缺点是更高的磁盘空间使用和崩溃后更长的恢复时间。
增加 shared_buffers。这里的默认值很小,通常为 32MB。您必须增加系统必须分配的 UNIX 共享内存量。完成后,有用的值通常大于总 RAM 的 1/4,最高可达 8GB。这里的增益率下降到 256MB 以上,但从默认值增加到那里确实很有帮助。
That's pretty much it. Anything else you touched that might help could potentially cause data corruption in a crash; these are all completely safe.
差不多就是这样。您触及的任何其他可能有帮助的东西都可能导致崩溃中的数据损坏;这些都是完全安全的。
回答by MarkR
22 minutes for 1 million rows doesn't seem thatslow, particularly if you have lots of indexes.
22分钟1万行似乎并不认为慢,特别是如果你有很多指标。
How are you doing the inserts? I take it you're using batch inserts, not one-row-per-transaction.
你是怎么做插入的?我认为您使用的是批量插入,而不是每笔交易一行。
Does PG support some kind of bulk loading, like reading from a text file or supplying a stream of CSV data to it? If so, you'd probably be best advised to use that.
PG 是否支持某种批量加载,例如从文本文件中读取或向其提供 CSV 数据流?如果是这样,最好建议您使用它。
Please post the code you're using to load the 1M records, and people will advise.
请发布您用来加载 1M 记录的代码,人们会提供建议。
Please post:
请发帖:
- CREATE TABLE statement for the table you're loading into
- Code you are using to load in
- small example of the data (if possible)
- 为您正在加载的表创建 TABLE 语句
- 您用于加载的代码
- 数据的小例子(如果可能)
EDIT: It seems the OP isn't interested in bulk-inserts, but is doing a performance test for many single-row inserts. I will assume that each insert is in its own transaction.
编辑:似乎 OP 对批量插入不感兴趣,但正在对许多单行插入进行性能测试。我将假设每个插入都在它自己的事务中。
- Consider batching the inserts on the client-side, per-node, writing them into a temporary file (hopefully durably / robustly) and having a daemon or some periodic process which asynchronously does a batch insert of outstanding records, in reasonable sized batches.
- This per-device batching mechanism really does give the best performance, in my experience, in audit-data like data-warehouse applications where the data don't need to go into the database just now. It also gives the application resilience against the database being unavailable.
- Of course you will normally have several endpoint devices creating audit-records (for example, telephone switches, mail relays, web application servers), each must have its own instance of this mechanism which is fully independent.
- This is a really "clever" optimisation which introduces a lot of complexity into the app design and has a lot of places where bugs could happen. Do not implement it unless you are really sureyou need it.
- 考虑在客户端对每个节点的插入进行批处理,将它们写入临时文件(希望持久/健壮),并拥有一个守护进程或一些定期进程,以合理大小的批次异步执行未完成记录的批量插入。
- 根据我的经验,这种每设备批处理机制确实在审计数据(例如数据仓库应用程序)中提供了最佳性能,其中数据不需要刚刚进入数据库。它还为应用程序提供了针对数据库不可用的弹性。
- 当然,您通常会有多个端点设备创建审计记录(例如,电话交换机、邮件中继、Web 应用程序服务器),每个设备都必须有自己的完全独立的这种机制的实例。
- 这是一个非常“聪明”的优化,它在应用程序设计中引入了很多复杂性,并且有很多可能发生错误的地方。除非你真的确定你需要它,否则不要实施它。
回答by user2772867
I think the problem can't be solved by dealing with the server only.
我认为仅通过处理服务器无法解决问题。
I found PostgreSQL can commit 3000+ rows per second, and both server and client were not busy, but the time went by. In contrast SQL Server can reach 5000+ rows per second, and Oracle is even faster, it can reach 12000+ per second, about 20 fields in a row.
我发现 PostgreSQL 每秒可以提交 3000+ 行,并且服务器和客户端都不忙,但是时间过去了。相比之下,SQL Server每秒可以达到5000+行,而Oracle更快,每秒可以达到12000+,大约20个字段为一排。
I guess the roundtrip is the problem: Send a row to server, and receive the reply from the server. Both SQL Server and Oracle support batch operations: send more than one row in a function call and wait for the reply.
我想往返是问题所在:向服务器发送一行,然后从服务器接收回复。SQL Server 和 Oracle 都支持批处理操作:在函数调用中发送多行并等待回复。
Many years ago I worked with Oracle: Trying to improve the write performance using OCI, I read documents and found too many round trips will decrease performance. Finally I solved it by using batch operations: send 128 or more rows to the server in a batch and wait for the reply. It reached 12000+ rows per second. If you do not use batches and send all rows individually (including wait), it reached only about 2000 rows per second.
多年前,我在 Oracle 工作:尝试使用 OCI 提高写入性能,我阅读文档,发现往返次数过多会降低性能。最后我通过批量操作解决了它:批量发送128行或更多行到服务器并等待回复。它达到每秒 12000 多行。如果不使用批处理并单独发送所有行(包括等待),则每秒仅达到约 2000 行。
回答by Mike Sherrill 'Cat Recall'
Well, you don't give us much to go on. But it sounds like you're looking for asynchronous commits.
好吧,你没有给我们太多的事情要做。但听起来您正在寻找异步提交。
Don't overlook a hardware upgrade--faster hardware usually means a faster database.
不要忽视硬件升级——更快的硬件通常意味着更快的数据库。
回答by a_horse_with_no_name
You should also increase checkpoint_segments(e.g. to 32 or even higher) and most probably wal_buffersas well
您还应该增加checkpoint_segments(例如,增加到 32 或更高),而且很可能wal_buffers也是如此
Edit:
if this is a bulk load, you should use COPY to insert the rows. It is much faster than plain INSERTs.
编辑:
如果这是批量加载,则应使用 COPY 插入行。它比普通的 INSERT 快得多。
If you need to use INSERT, did you consider using batching (for JDBC) or multi-row inserts?
如果您需要使用 INSERT,您是否考虑使用批处理(对于 JDBC)或多行插入?
回答by Dave Turner
1M commitsin 22 minutes seems reasonable, even with synchronous_commit = off, but if you can avoid the need to commit on each insert then you can get a lot faster than that. I just tried inserting 1M (identical) rows into your example table from 10 concurrent writers, using the bulk-insert COPYcommand:
在 22 分钟内提交1M似乎是合理的,即使使用synchronous_commit = off,但如果您可以避免在每次插入时提交的需要,那么您可以获得比这更快的速度。我只是尝试使用批量插入COPY命令将 1M(相同)行从 10 个并发编写器插入到您的示例表中:
$ head -n3 users.txt | cat -A # the rest of the file is just this another 99997 times
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
$ wc -l users.txt
100000 users.txt
$ time (seq 10 | xargs --max-procs=10 -n 1 bash -c "cat users.txt | psql insertspeed -c 'COPY \"user\" (username, email, website, created) FROM STDIN WITH (FORMAT text);'")
real 0m10.589s
user 0m0.281s
sys 0m0.285s
$ psql insertspeed -Antc 'SELECT count(*) FROM "user"'
1000000
Clearly there's only 10 commits there, which isn't exactly what you're looking for, but that hopefully gives you some kind of indication of the speed that might be possible by batching your inserts together. This is on a VirtualBox VM running Linux on a fairly bog-standard Windows desktop host, so not exactly the highest-performance hardware possible.
很明显,那里只有 10 个提交,这不是您正在寻找的,但希望通过将您的插入一起批处理可能会为您提供某种速度的指示。这是在一个相当标准的 Windows 桌面主机上运行 Linux 的 VirtualBox VM 上,因此不是最高性能的硬件。
To give some less toy figures, we have a service running in production which has a single thread that streams data to Postgres via a COPYcommand similar to the above. It ends a batch and commits after a certain number of rows or if the transaction reaches a certain age (whichever comes first). It can sustain 11,000 inserts per second with a max latency of ~300ms by doing ~4 commits per second. If we tightened up the maximum permitted age of the transactions we'd get more commits per second which would reduce the latency but also the throughput. Again, this is not on terribly impressive hardware.
为了提供一些不那么有趣的数字,我们有一个在生产中运行的服务,它有一个线程,通过COPY类似于上面的命令将数据流式传输到 Postgres 。它结束一个批处理并在一定数量的行之后或如果事务达到一定年龄(以先到者为准)提交。通过每秒执行 ~4 次提交,它可以维持每秒 11,000 次插入,最大延迟为 ~300 毫秒。如果我们收紧事务的最大允许年龄,我们每秒将获得更多提交,这将减少延迟,但也会减少吞吐量。同样,这不是在令人印象深刻的硬件上。
Based on that experience, I'd strongly recommend trying to use COPYrather than INSERT, and trying to reduce the number of commits as far as possible while still achieving your latency target.
基于这种经验,我强烈建议尝试使用COPY而不是INSERT,并尝试在仍然实现延迟目标的同时尽可能减少提交次数。
回答by Dave Turner
Well one thing you could do to speed things up is drop the index you are creating manually - the primary keyconstraint already auto-creates a unique index on that column as you can see below (I'm testing on 8.3):
好吧,您可以做的一件事是删除您手动创建的索引 -primary key约束已经在该列上自动创建了一个唯一索引,如下所示(我正在 8.3 上进行测试):
postgres=> CREATE TABLE "user"
postgres-> (
postgres(> id serial NOT NULL,
postgres(> username character varying(40),
postgres(> email character varying(70),
postgres(> website character varying(100),
postgres(> created integer,
postgres(> CONSTRAINT user_pkey PRIMARY KEY (id)
postgres(> )
postgres-> WITH ( OIDS=FALSE );
NOTICE: CREATE TABLE will create implicit sequence "user_id_seq" for serial column "user.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "user_pkey" for table "user"
CREATE TABLE
postgres=> CREATE INDEX id ON "user" USING btree (id);
CREATE INDEX
postgres=> \d user
Table "stack.user"
Column | Type | Modifiers
----------+------------------------+---------------------------------------------------
id | integer | not null default nextval('user_id_seq'::regclass)
username | character varying(40) |
email | character varying(70) |
website | character varying(100) |
created | integer |
Indexes:
"user_pkey" PRIMARY KEY, btree (id)
"id" btree (id)
Also, consider changing wal_sync_methodto an option that uses O_DIRECT- this is not the default on Linux
另外,请考虑更改wal_sync_method为使用的选项O_DIRECT-这不是 Linux 上的默认设置
回答by champomy
One possibility would be to use the keywork DEFERRABLE to defer constraints because constraints are checked for every lines.
一种可能性是使用关键工作 DEFERRABLE 来推迟约束,因为每行都检查约束。
So the idea would be to ask postgresql to check constraints just before you commit.
所以这个想法是让 postgresql 在你提交之前检查约束。

