如何修复无法跟上主站的 PostgreSQL 9.3 从站?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28201475/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I fix a PostgreSQL 9.3 Slave that Cannot Keep Up with the Master?
提问by Oleg Dulin
We have a master-slave replication configuration as follows.
我们有如下的主从复制配置。
On the master:
在主人身上:
postgresql.conf
has replication configured as follows (commented line taken out for brevity):
postgresql.conf
复制配置如下(注释行为简洁起见):
max_wal_senders = 1
wal_keep_segments = 8
On the slave:
在奴隶上:
Same postgresql.conf
as on the master. recovery.conf
looks like this:
同样postgresql.conf
作为主。recovery.conf
看起来像这样:
standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication password=replication'
trigger_file = '/tmp/postgresql.trigger.5432'
When this was initially setup, we performed some simple tests and confirmed the replication was working. However, when we did the initial data load, only some of the data made it to the slave.
最初设置时,我们执行了一些简单的测试并确认复制工作正常。但是,当我们进行初始数据加载时,只有部分数据到达了从站。
Slave's log is now filled with messages that look like this:
Slave 的日志现在充满了如下所示的消息:
< 2015-01-23 23:59:47.241 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:47.241 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:52.259 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:52.260 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:57.270 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:57.270 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
After some analysis and help on the #postgresql IRC channel, I've come to the conclusion that the slave cannot keep up with the master. My proposed solution is as follows.
在#postgresql IRC 频道上进行了一些分析和帮助后,我得出的结论是从站跟不上主站。我提出的解决方案如下。
On the master:
在主人身上:
- Set
max_wal_senders=5
- Set
wal_keep_segments=4000
. Yes I know it is very high, but I'd like to monitor the situation and see what happens. I have room on the master.
- 放
max_wal_senders=5
- 设置
wal_keep_segments=4000
。是的,我知道它非常高,但我想监控情况,看看会发生什么。我有主人的空间。
On the slave:
在奴隶上:
- Save configuration files in the data directory (i.e.
pg_hba.conf pg_ident.conf postgresql.conf recovery.conf
) - Clear out the data directory (
rm -rf /var/lib/pgsql/9.3/data/*
) . This seems to be required bypg_basebackup
. - Run the following command:
pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password
- 将配置文件保存在数据目录中(即
pg_hba.conf pg_ident.conf postgresql.conf recovery.conf
) - 清除数据目录 (
rm -rf /var/lib/pgsql/9.3/data/*
)。这似乎是pg_basebackup
. - 运行以下命令:
pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password
Am I missing anything ? Is there a better way to bring the slave up-to-date w/o having to reload all the data ?
我错过了什么吗?有没有更好的方法可以使从站保持最新状态而不必重新加载所有数据?
Any help is greatly appreciated.
任何帮助是极大的赞赏。
回答by Ben Grimm
The two important options for dealing with the WALfor streaming replication:
wal_keep_segments
should be set high enough to allow a slave to catch up after a reasonable lag (e.g. high update volume, slave being offline, etc...).archive_mode
enables WAL archiving which can be used to recover files older thanwal_keep_segments
provides. The slave servers simply need a method to retrieve the WAL segments. NFS is the simplest method, but anything from scp to http to tapes will work so long as it can be scripted.# on master archive_mode = on archive_command = 'cp %p /path_to/archive/%f' # on slave restore_command = 'cp /path_to/archive/%f "%p"'
When the slave can't pull the WAL segment directly from the master, it will attempt to use the
restore_command
to load it. You can configure the slave to automatically remove segments using thearchive_cleanup_command
setting.
wal_keep_segments
应该设置得足够高以允许从站在合理的延迟后赶上(例如高更新量,从站离线等......)。archive_mode
启用 WAL 归档,可用于恢复比wal_keep_segments
提供的文件更旧的文件。从服务器只需要一种方法来检索 WAL 段。NFS 是最简单的方法,但是从 scp 到 http 再到磁带的任何方法都可以使用,只要它可以编写脚本。# on master archive_mode = on archive_command = 'cp %p /path_to/archive/%f' # on slave restore_command = 'cp /path_to/archive/%f "%p"'
当从站不能直接从主站拉取 WAL 段时,它将尝试使用
restore_command
加载它。您可以将从站配置为使用该archive_cleanup_command
设置自动删除段。
If the slave comes to a situation where the next WAL segment it needs is missing from both the master and the archive, there will be no way to consistently recover the database. The onlyreasonable option then is to scrub the server and start again from a fresh pg_basebackup
.
如果从服务器遇到主服务器和存档都缺少它需要的下一个 WAL 段的情况,则将无法始终如一地恢复数据库。该唯一合理的选择则是磨砂的服务器,并从一个新的重新开始pg_basebackup
。
回答by linehrr
actually to recover, you don't have to drop the whole DB and start from scratch. since master has up-to-date binary, you can do following to recover the slave and bring them back to in-sync:
实际上要恢复,您不必删除整个数据库并从头开始。由于 master 具有最新的二进制文件,您可以执行以下操作来恢复 slave 并将它们恢复同步:
psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* <data_dir> slave_IP_address:<data_dir>
psql -c "select pg_stop_backup();"
Note:
1. slave has to be turned down by service stop
2. master will turn to read-only due to query pg_start_backup
3. master can continue serving read only queries
4. bring back slave at the end of the steps
注意:
1.slave 必须被拒绝service stop
2.master 将由于查询而变为pg_start_backup
只读 3.master 可以继续提供只读查询
4. 在步骤结束时恢复 slave
I did this in prod, it works perfect for me. slave and master are in sync and there is no data loss.
我在 prod 中做到了这一点,它对我来说很完美。从站和主站同步,没有数据丢失。
回答by Ryabchenko Alexander
You can configure replication slots
for postgress to keep WAL segments for replica mentioned in such slot.
您可以配置replication slots
postgress 以保留此类插槽中提到的副本的 WAL 段。
Reed more at https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/
更多信息请访问https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/
On master server run
在主服务器上运行
SELECT pg_create_physical_replication_slot('standby_slot');
On slave server add to recovery.conf
next line
在从服务器上添加到recovery.conf
下一行
primary_slot_name = 'standby_slot'
回答by Sarvar Nishonboev
You will get that error if keep_wal_segments
setting is too low.
When you set the value for keep_wal_segments
consider that "How long is the pg_basebackup
taking?"
如果keep_wal_segments
设置太低,您将收到该错误。当您设置值时,请keep_wal_segments
考虑“需要多长时间pg_basebackup
?”
Remember that segments are generated about every 5 minutes, so if the backup takes an hour, you need at least 12 segments saved. At 2 hours, you need 24, etc. I would set the value to about 12.2 segments/hour of backup.
请记住,分段大约每 5 分钟生成一次,因此如果备份需要一个小时,则至少需要保存 12 个分段。在 2 小时时,您需要 24,依此类推。我会将值设置为大约 12.2 段/小时的备份。
回答by Oleg Dulin
As Ben Grimm suggested in the comments, this is a question of making sure to set segments to the maximum possible value to allow the slave to catch up.
正如 Ben Grimm 在评论中所建议的,这是一个确保将段设置为最大可能值以允许从站赶上的问题。