您的位置: 首页> 数据库

PostgreSQL repmgr 高可用之故障转移

匿名上传

发布时间:2025-09-28 09:00:03

PostgreSQL高可用之repmgr自动切换

之前写过一个repmgr的高可用搭建的，https://www.cnblogs.com/wy123/p/18531710，repmgr的搭建过程还是比较简单的，具体过程不再赘述。这里为了简化，做了1主2从的结构，之前一直没空测试repmgr的手动和自动故障转移，抽空找了个环境，做了个repmgr的故障转移测试。

环境：

ubuntu05:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）
ubuntu06:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）
ubuntu07:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）

1，ubuntu05，ubuntu06，ubuntu07是一个repmgr集群，ubuntu05为主节点，其他两个为从节点
2，强制关闭ubuntu05上的PostgreSQL服务
3，repmgr完整自动故障转移，自动提升ubuntu06为这点

repmgr配置

repmgr的配置文件repmgr.conf

node_id=2node_name='ubuntu06'conninfo='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'data_directory='/usr/local/pgsql16/pg9000/data'pg_bindir='/usr/local/pgsql16/server/bin'priority=80#自动故障转移配置failover=automaticpromote_command='/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file'follow_command='/usr/local/pgsql16/server/bin/repmgr standby follow -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'log_file='/usr/local/pgsql16/repmgr/repmgr.log'#要启用 repmgrd 守护进程和监控，需在 repmgr.conf中启用 moitoring_history=yesmonitoring_history=true#默认监控时间间隔为2秒monitor_interval_secs=5#故障转移之前，尝试重新连接主库次数（默认为6）参数reconnect_attempts=12#每间隔5s尝试重新连接一次参数reconnect_interval=5

repmgrd的systemd服务启动脚本，设置repmgrd自动启动

[Unit]Description=PostgreSQL Replication Manager DaemonAfter=network.target postgresql9000.serviceRequires=postgresql9000.service[Service]Type=forkingUser=postgresGroup=postgresExecStart=/usr/local/pgsql16/server/bin/repmgrd -f /usr/local/pgsql16/repmgr/repmgr.conf --pid-file /usr/local/pgsql16/repmgr/repmgrd.pidExecStop=/bin/kill -QUIT $MAINPIDPIDFile=/usr/local/pgsql16/repmgr/repmgrd.pidRestart=alwaysRestartSec=5# 环境变量（如果需要）Environment=PATH=/usr/local/pgsql16/server/bin:/usr/local/bin:/usr/bin:/bin[Install]WantedBy=multi-user.target

手动切换主从

repmgr的前置条件是需要节点之间ssh互信，1，手动故障转移，哪个从节点需要提升为主节点，就在哪个节点上执行：    /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow    --siblings-follow  表示所有从库的同步源自动改成最新的主库节点    switchover的内部流程如下：    1.关闭当前的主库 ubuntu06    2.等待老主库彻底关闭后，在 ubuntu05 上进行 pg_promote()    3.重启启动老主库 ubuntu06， 降级成 standby 数据库， 指向复制源 ubuntu05    4.sibling nodes兄弟节点同样进行了复制源重定向，指向 ubuntu05    5.整个switchover 过程结束        在当前节点Ubuntu04查看集群状态	repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show	postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show	 ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------	 1  | ubuntu05 | standby |   running | ubuntu06 | default  | 80       | 2        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100	 2  | ubuntu06 | primary | * running |          | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100	 3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100	postgres@ubuntu05:~$	postgres@ubuntu05:~$	postgres@ubuntu05:~$        执行switchover    postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow	NOTICE: executing switchover on node "ubuntu05" (ID: 1)	NOTICE: attempting to pause repmgrd on 3 nodes	NOTICE: local node "ubuntu05" (ID: 1) will be promoted to primary; current primary "ubuntu06" (ID: 2) will be demoted to standby	NOTICE: stopping current primary node "ubuntu06" (ID: 2)	NOTICE: issuing CHECKPOINT on node "ubuntu06" (ID: 2)	DETAIL: executing server command "/usr/local/pgsql16/server/bin/pg_ctl  -D '/usr/local/pgsql16/pg9000/data' -W -m fast stop"	INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")	INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")	INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")	INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")	INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")	INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")	NOTICE: current primary has been cleanly shut down at location 0/18000028	NOTICE: promoting standby to primary	DETAIL: promoting server "ubuntu05" (ID: 1) using pg_promote()	NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete	NOTICE: STANDBY PROMOTE successful	DETAIL: server "ubuntu05" (ID: 1) was successfully promoted to primary	NOTICE: node "ubuntu05" (ID: 1) promoted to primary, node "ubuntu06" (ID: 2) demoted to standby	NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings	INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes	NOTICE: switchover was successful	DETAIL: node "ubuntu05" is now primary and node "ubuntu06" is attached as standby	NOTICE: STANDBY SWITCHOVER has completed successfully	postgres@ubuntu05:~$	postgres@ubuntu05:~$    	postgres@ubuntu05:~$	postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show	 ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------	 1  | ubuntu05 | primary | * running |          | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100	 2  | ubuntu06 | standby |   running | ubuntu05 | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100	 3  | ubuntu07 | standby |   running | ubuntu05 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100	postgres@ubuntu05:~$

手动故障转移

1，kill或者停止主节点服务来模拟主节点故障    systemctl stop postgresql9000		2，从节点上查看集群状态，此时原始主节点已不可达	postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show		ID | Name     | Role    | Status        | Upstream   | Location | Priority | Timeline | Connection string	----+----------+---------+---------------+------------+----------+----------+----------+------------------------------------------------------------------------------		1  | ubuntu05 | primary | ? unreachable | ?          | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100		2  | ubuntu06 | standby |   running     | ? ubuntu05 | default  | 80       | 3        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100		3  | ubuntu07 | standby |   running     | ? ubuntu05 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100	WARNING: following issues were detected		- unable to connect to node "ubuntu05" (ID: 1)		- node "ubuntu05" (ID: 1) is registered as an active primary but is unreachable		- unable to connect to node "ubuntu06" (ID: 2)'s upstream node "ubuntu05" (ID: 1)		- unable to determine if node "ubuntu06" (ID: 2) is attached to its upstream node "ubuntu05" (ID: 1)		- unable to connect to node "ubuntu07" (ID: 3)'s upstream node "ubuntu05" (ID: 1)		- unable to determine if node "ubuntu07" (ID: 3) is attached to its upstream node "ubuntu05" (ID: 1)	HINT: execute with --verbose option to see connection error messages	postgres@ubuntu06:~$		3，手动 promote 把 ubuntu06 提升为主库	/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow	检查集群状态，此时Ubuntu06已经成为主节点，原主库 pg02 被标记为 failed 的状态			postgres@ubuntu06:~$	postgres@ubuntu06:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow	NOTICE: promoting standby to primary	DETAIL: promoting server "ubuntu06" (ID: 2) using pg_promote()	NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete	NOTICE: STANDBY PROMOTE successful	DETAIL: server "ubuntu06" (ID: 2) was successfully promoted to primary	NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings	INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes	postgres@ubuntu06:~$	postgres@ubuntu06:~$###检查集群状态，此时Ubuntu06已经成为主节点	postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show		ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string	----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------		1  | ubuntu05 | primary | - failed  | ?        | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100		2  | ubuntu06 | primary | * running |          | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100		3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100			WARNING: following issues were detected		- unable to connect to node "ubuntu05" (ID: 1)			HINT: execute with --verbose option to see connection error messages	postgres@ubuntu06:~$			4，老主库重新加入集群    4.1 启动老主库		root@ubuntu05:~# systemctl start postgresql9000		root@ubuntu05:~#		root@ubuntu05:~# su - postgres		postgres@ubuntu05:~$		postgres@ubuntu05:~$		postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show		 ID | Name     | Role    | Status               | Upstream   | Location | Priority | Timeline | Connection string		----+----------+---------+----------------------+------------+----------+----------+----------+------------------------------------------------------------------------------		 1  | ubuntu05 | primary | * running            |            | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100		 2  | ubuntu06 | standby | ! running as primary |            | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100		 3  | ubuntu07 | standby |   running            | ! ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100				WARNING: following issues were detected		  - node "ubuntu06" (ID: 2) is registered as standby but running as primary		  - node "ubuntu07" (ID: 3) reports a different upstream (reported: "ubuntu06", expected "ubuntu05")				postgres@ubuntu05:~$			4.2 执行pg_rewind		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run		 		postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run		NOTICE: rejoin target is node "ubuntu06" (ID: 2)		INFO: replication connection to the rejoin target node was successful		INFO: local and rejoin target system identifiers match		DETAIL: system identifier is 7550951818891860956		NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2		DETAIL: rejoin target server s timeline 4 forked off current database system timeline 3 before current recovery point 0/1B000028		INFO: prerequisites for using pg_rewind are met		INFO: pg_rewind would now be executed		DETAIL: pg_rewind command is:		  /usr/local/pgsql16/server/bin/pg_rewind -D '/usr/local/pgsql16/pg9000/data' --source-server='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'		INFO: prerequisites for executing NODE REJOIN are met		postgres@ubuntu05:~$		postgres@ubuntu05:~$				或者简单粗暴，直接删除本地的数据，重新克隆				克隆数据库 		/usr/local/pgsql16/server/bin/repmgr -h 192.168.152.112 -p 9000 -U repmgr -d repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby clone --dry-run		直接启动数据库服务即可		--取消注册，实际上是从nodes表中删除数据		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby unregister		--重新注册，重新将repmgr.conf中的配置加载到nodes表中		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register						--强制注册force，实际上就是覆盖现有的配置		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register --force		--指定主节点，一般不用指定，直接会根据postgresql.auto.conf找到主节点		/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register  --upstream-node-id=2								对正常节点重新注册，目的是修改配置之后，重新注册会，达到重新加载的功能，从节点(pg02，pg03)进行重新注册操作		$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister		$ repmgr -f /home/postgres/repmgr/repmgr.conf standby register --upstream-node-id=1

自动故障转移

强制关闭主节点Ubuntu05上的PostgreSQL服务模拟故障

自动故障转移过程如下：

repmgr的转移过程日志，可以看到repmgr会根据上面配置文件的重试间隔reconnect_interval和重试参数reconnect_attempts，一直重试，如果最终主节点不可达，开始故障转移，整个过程为1分钟

[2025-09-18 13:24:00] [INFO] monitoring connection to upstream node "ubuntu05" (ID: 1)[2025-09-18 13:26:26] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state[2025-09-18 13:26:26] [DETAIL] last monitoring statistics update was 5 seconds ago[2025-09-18 13:29:01] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state[2025-09-18 13:29:01] [DETAIL] last monitoring statistics update was 5 seconds ago***************************************************这里开始模拟主节点故障，从节点开始重试*************************************************************************[2025-09-18 13:30:01] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"[2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:01] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)[2025-09-18 13:30:01] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts[2025-09-18 13:30:01] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:01] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:02] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"[2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:02] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)[2025-09-18 13:30:02] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts[2025-09-18 13:30:02] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:02] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:06] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts[2025-09-18 13:30:06] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:06] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:07] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts[2025-09-18 13:30:07] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:07] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:11] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts[2025-09-18 13:30:11] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:11] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:12] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts[2025-09-18 13:30:12] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:12] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:12] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:16] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts[2025-09-18 13:30:16] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:16] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:17] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts[2025-09-18 13:30:17] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:17] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts[2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts[2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts[2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts[2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts[2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts[2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts[2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts[2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts[2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts[2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts[2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts[2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts[2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts[2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt[2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts[2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts[2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered[2025-09-18 13:30:57] [INFO] 3 total nodes registered[2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")[2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0[2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago[2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago[2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)[2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds[2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)[2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes[2025-09-18 13:30:57] [INFO] promote_command is:  "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"[2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"[2025-09-18 13:30:57] [WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified[2025-09-18 13:30:57] [DETAIL] these nodes will remain attached to the current primary:  ubuntu07 (node ID: 3)[2025-09-18 13:30:57] [NOTICE] promoting standby to primary[2025-09-18 13:30:57] [DETAIL] promoting server "ubuntu06" (ID: 2) using pg_promote()[2025-09-18 13:30:57] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete[2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts[2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts[2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered[2025-09-18 13:30:57] [INFO] 3 total nodes registered[2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")[2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0[2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago[2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago[2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)[2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)[2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds[2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)[2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes[2025-09-18 13:30:57] [INFO] promote_command is:  "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"[2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"[2025-09-18 13:30:57] [ERROR] STANDBY PROMOTE can only be executed on a standby node[2025-09-18 13:30:57] [ERROR] promote command failed[2025-09-18 13:30:57] [DETAIL] promote command exited with error code 8[2025-09-18 13:30:57] [INFO] checking if original primary node has reappeared[2025-09-18 13:30:57] [ERROR] connection to database failed[2025-09-18 13:30:57] [DETAIL] connection to server at "192.168.152.111", port 9000 failed: Connection refused	Is the server running on that host and accepting TCP/IP connections?[2025-09-18 13:30:57] [DETAIL] attempted to connect using:  user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr options=-csearch_path=[2025-09-18 13:30:57] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"[2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"[2025-09-18 13:30:57] [NOTICE] local node is primary, checking local node state[2025-09-18 13:30:57] [NOTICE] resuming monitoring as primary node after 0 seconds[2025-09-18 13:30:57] [INFO] 1 followers to notify[2025-09-18 13:30:57] [INFO] reconnecting to node "ubuntu07" (ID: 3)...[2025-09-18 13:30:57] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2INFO:  node 3 received notification to follow node 2[2025-09-18 13:30:57] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)[2025-09-18 13:30:58] [NOTICE] STANDBY PROMOTE successful[2025-09-18 13:30:58] [DETAIL] server "ubuntu06" (ID: 2) was successfully promoted to primary[2025-09-18 13:30:58] [INFO] checking state of node 2, 1 of 12 attempts[2025-09-18 13:30:58] [NOTICE] node 2 has recovered, reconnecting[2025-09-18 13:30:58] [INFO] connection to node 2 succeeded[2025-09-18 13:30:58] [INFO] original connection is still available[2025-09-18 13:30:58] [INFO] 1 followers to notify[2025-09-18 13:30:58] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2INFO:  node 3 received notification to follow node 2[2025-09-18 13:30:58] [INFO] switching to primary monitoring mode[2025-09-18 13:30:58] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)[2025-09-18 13:30:58] [INFO] child node "ubuntu07" (ID: 3) is attached[2025-09-18 13:31:02] [NOTICE] new standby "ubuntu07" (ID: 3) has connected[2025-09-18 13:35:57] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:35:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:45:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:45:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:50:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state[2025-09-18 13:50:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state

repmgr的优缺点总结

repmgr在高可用方案上，勉强能用吧。
优点是安装配置都比较简单，
缺点是没办法做到连续自动故障转移，第一次转移完成后，故障节点想拉起来，还是要先做手动pg_rewind。
repmgr把元数据保存在本地的PostgreSQL数据库中，数据库启动之前repmgr进程不知道集群状态，所以不可能自动rewind，这也就是用PostgreSQL自身保存集群元数据的缺陷，也算是跟partoni的差距吧。