如何修复无法跟上主站的 PostgreSQL 9.3 从站？

Question

提问by Oleg Dulin

We have a master-slave replication configuration as follows.

我们有如下的主从复制配置。

On the master:

在主人身上：

postgresql.confhas replication configured as follows (commented line taken out for brevity):

postgresql.conf复制配置如下（注释行为简洁起见）：

max_wal_senders = 1            
wal_keep_segments = 8

On the slave:

在奴隶上：

Same postgresql.confas on the master. recovery.conflooks like this:

同样postgresql.conf作为主。recovery.conf看起来像这样：

standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication password=replication'
trigger_file = '/tmp/postgresql.trigger.5432'

When this was initially setup, we performed some simple tests and confirmed the replication was working. However, when we did the initial data load, only some of the data made it to the slave.

最初设置时，我们执行了一些简单的测试并确认复制工作正常。但是，当我们进行初始数据加载时，只有部分数据到达了从站。

Slave's log is now filled with messages that look like this:

Slave 的日志现在充满了如下所示的消息：

< 2015-01-23 23:59:47.241 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:47.241 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

< 2015-01-23 23:59:52.259 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:52.260 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

< 2015-01-23 23:59:57.270 EST >LOG:  started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:57.270 EST >FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000F00000052 has already been removed

After some analysis and help on the #postgresql IRC channel, I've come to the conclusion that the slave cannot keep up with the master. My proposed solution is as follows.

在#postgresql IRC 频道上进行了一些分析和帮助后，我得出的结论是从站跟不上主站。我提出的解决方案如下。

On the master:

在主人身上：

Set max_wal_senders=5
Set wal_keep_segments=4000. Yes I know it is very high, but I'd like to monitor the situation and see what happens. I have room on the master.

放 max_wal_senders=5
设置wal_keep_segments=4000。是的，我知道它非常高，但我想监控情况，看看会发生什么。我有主人的空间。

On the slave:

在奴隶上：

Save configuration files in the data directory (i.e. pg_hba.conf pg_ident.conf postgresql.conf recovery.conf)
Clear out the data directory (rm -rf /var/lib/pgsql/9.3/data/*) . This seems to be required by pg_basebackup.
Run the following command: pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password

将配置文件保存在数据目录中（即pg_hba.conf pg_ident.conf postgresql.conf recovery.conf）
清除数据目录 ( rm -rf /var/lib/pgsql/9.3/data/*)。这似乎是pg_basebackup.
运行以下命令： pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password

Am I missing anything ? Is there a better way to bring the slave up-to-date w/o having to reload all the data ?

我错过了什么吗？有没有更好的方法可以使从站保持最新状态而不必重新加载所有数据？

Any help is greatly appreciated.

任何帮助是极大的赞赏。

Answer 1

回答by Ben Grimm

The two important options for dealing with the WALfor streaming replication:

为应付这两个重要的选项WAL为流复制：

wal_keep_segmentsshould be set high enough to allow a slave to catch up after a reasonable lag (e.g. high update volume, slave being offline, etc...).
archive_modeenables WAL archiving which can be used to recover files older than wal_keep_segmentsprovides. The slave servers simply need a method to retrieve the WAL segments. NFS is the simplest method, but anything from scp to http to tapes will work so long as it can be scripted.
```
# on master
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f' 

# on slave
restore_command = 'cp /path_to/archive/%f "%p"'
```
When the slave can't pull the WAL segment directly from the master, it will attempt to use the restore_commandto load it. You can configure the slave to automatically remove segments using the archive_cleanup_commandsetting.

wal_keep_segments应该设置得足够高以允许从站在合理的延迟后赶上（例如高更新量，从站离线等......）。
archive_mode启用 WAL 归档，可用于恢复比wal_keep_segments提供的文件更旧的文件。从服务器只需要一种方法来检索 WAL 段。NFS 是最简单的方法，但是从 scp 到 http 再到磁带的任何方法都可以使用，只要它可以编写脚本。
```
# on master
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f' 

# on slave
restore_command = 'cp /path_to/archive/%f "%p"'
```
当从站不能直接从主站拉取 WAL 段时，它将尝试使用restore_command加载它。您可以将从站配置为使用该archive_cleanup_command设置自动删除段。

If the slave comes to a situation where the next WAL segment it needs is missing from both the master and the archive, there will be no way to consistently recover the database. The onlyreasonable option then is to scrub the server and start again from a fresh pg_basebackup.

如果从服务器遇到主服务器和存档都缺少它需要的下一个 WAL 段的情况，则将无法始终如一地恢复数据库。该唯一合理的选择则是磨砂的服务器，并从一个新的重新开始pg_basebackup。

Answer 2

回答by linehrr

actually to recover, you don't have to drop the whole DB and start from scratch. since master has up-to-date binary, you can do following to recover the slave and bring them back to in-sync:

实际上要恢复，您不必删除整个数据库并从头开始。由于 master 具有最新的二进制文件，您可以执行以下操作来恢复 slave 并将它们恢复同步：

psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* <data_dir> slave_IP_address:<data_dir>
psql -c "select pg_stop_backup();"

Note:
1. slave has to be turned down by service stop
2. master will turn to read-only due to query pg_start_backup
3. master can continue serving read only queries
4. bring back slave at the end of the steps

注意：
1.slave 必须被拒绝service stop
2.master 将由于查询而变为pg_start_backup
只读 3.master 可以继续提供只读查询
4. 在步骤结束时恢复 slave

I did this in prod, it works perfect for me. slave and master are in sync and there is no data loss.

我在 prod 中做到了这一点，它对我来说很完美。从站和主站同步，没有数据丢失。

Answer 3

回答by Ryabchenko Alexander

You can configure replication slotsfor postgress to keep WAL segments for replica mentioned in such slot.

您可以配置replication slotspostgress 以保留此类插槽中提到的副本的 WAL 段。

Reed more at https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/

更多信息请访问https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/

On master server run

在主服务器上运行

SELECT pg_create_physical_replication_slot('standby_slot');

On slave server add to recovery.confnext line

在从服务器上添加到recovery.conf下一行

primary_slot_name = 'standby_slot'

Answer 4

回答by Sarvar Nishonboev

You will get that error if keep_wal_segmentssetting is too low. When you set the value for keep_wal_segmentsconsider that "How long is the pg_basebackuptaking?"

如果keep_wal_segments设置太低，您将收到该错误。当您设置值时，请keep_wal_segments考虑“需要多长时间pg_basebackup？”

Remember that segments are generated about every 5 minutes, so if the backup takes an hour, you need at least 12 segments saved. At 2 hours, you need 24, etc. I would set the value to about 12.2 segments/hour of backup.

请记住，分段大约每 5 分钟生成一次，因此如果备份需要一个小时，则至少需要保存 12 个分段。在 2 小时时，您需要 24，依此类推。我会将值设置为大约 12.2 段/小时的备份。

Answer 5

回答by Oleg Dulin

As Ben Grimm suggested in the comments, this is a question of making sure to set segments to the maximum possible value to allow the slave to catch up.

正如 Ben Grimm 在评论中所建议的，这是一个确保将段设置为最大可能值以允许从站赶上的问题。

如何修复无法跟上主站的 PostgreSQL 9.3 从站？

提问by Oleg Dulin

回答by Ben Grimm

回答by linehrr

回答by Ryabchenko Alexander

回答by Sarvar Nishonboev

回答by Oleg Dulin

相关推荐

最近更新

标签

如何修复无法跟上主站的 PostgreSQL 9.3 从站？

提问by Oleg Dulin

回答by Ben Grimm

回答by linehrr

回答by Ryabchenko Alexander

回答by Sarvar Nishonboev

回答by Oleg Dulin

相关推荐

Windows 上的 git-svn。从哪里获得二进制文件？

postgresql 如何使用“RAISE INFO, RAISE LOG， RAISE DEBUG”来追踪登录PostgreSQL的功能？

windows 如何使我的程序与 DEP 兼容？

PostgreSQL 中的 DATEADD 等价物

相关推荐

最近更新

标签