奠基免安装绿色中文版
6.43G · 2025-10-11
PostgreSQL pg_auto_failover 高可用 1:pg_auto_failover集群搭建
PostgreSQL pg_auto_failover 高可用 2:pg_auto_failover集群运维
通过pg_autoctl get formation settings查看集群参数设置
root@ubuntu11:~# pg_autoctl get formation settings Context | Name | Setting | Value----------+----------+---------------------------+------------------------------------formation | default | number_sync_standbys | 0 primary | ubuntu12 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_60)' node | ubuntu13 | candidate priority | 50 node | ubuntu12 | candidate priority | 50 node | ubuntu13 | replication quorum | true node | ubuntu12 | replication quorum | true
参数说明
number_sync_standbys:同步备库的数量
number_sync_standbys表示设置强同步节点的数量,如果为0的话,表示当从节点故障之后,主节点依旧可以支持读写,否则,如果可用的备用节点数量小于number_sync_standbys设置的个数的时候,主节点的写操作会被挂起。
number_sync_standbys的设置
pg_autoctl set formation number_sync_standbys 1
Replication quorum
此参数可设置为 true或 false,默认情况下为 true,表示有可能成为同步备库。如果设置为 false,表示使用异步复制。
Candidate Priority
该参数表示候选备库优先级,可以设为0 到100 间的任意值,缺省值是50。优先级越大,越容易备选择成为主库。如果设置为0,则不会选择该节点为新主库。
当备库候选优先级相同时,Monitor节点将选择LSN位置最高的的备库。如果各个备库的LSN位置相同,则随机选择一个备库。
实际测试中到一个有意思的问题,主备节点都正常的情况下
1,从pg_auto_failover的维度看,synchronous_standby_names 为 'ANY 1 (pgautofailover_standby_60)'
2,从postgresql实例的维度看,synchronous_commit = on,synchronous_standby_names=ANY 1 (pgautofailover_standby_60)
这意味着主备节点是同步复制。
但是当关闭备节点来模拟备节点故障的时候,发现synchronous_standby_names为'',此时再从主节点查询synchronous_standby_names,其值也变为'',意味着自动降级为异步复制。
select name,setting from pg_settings where name like '%synchronous_commit%';name |setting|------------------+-------+synchronous_commit|on |select name,setting from pg_settings where name like '%synchronous_standby_names%' ;name |setting |-------------------------+---------------------------------+synchronous_standby_names|ANY 1 (pgautofailover_standby_60)|
如果想设置为强同步模式,通过pg_autoctl set formation number-sync-standbys 1设置,这时会发现,一主一备的模式是不支持的
通过重启主节点的pgautofailover服务来模拟故障,发现会备用节点自动切换为主节点,原主节点启动后作为备用节点加入集群。
root@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 9: 0/5039798 | read-only | secondary | secondaryubuntu12 | 68 | ubuntu12:9300 | 9: 0/5039798 | read-write | primary | primaryroot@ubuntu11:~#root@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 10: 0/5039960 | read-write | wait_primary | wait_primaryubuntu12 | 68 | ubuntu12:9300 | 9: 0/5039810 | none | demoted | catchinguproot@ubuntu11:~#root@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 10: 0/5039A10 | read-write | primary | primaryubuntu12 | 68 | ubuntu12:9300 | 10: 0/5039A10 | read-only | secondary | secondaryroot@ubuntu11:~#
在待提升为主节点的备节点上执行pg_autoctl perform switchover,可以看到当前节点提升为主节点
root@ubuntu12:~# pg_autoctl perform switchover05:12:23 213044 INFO Targetting group 0 in formation "default"05:12:23 213044 INFO Listening monitor notifications about state changes in formation "default" and group 005:12:23 213044 INFO Following table displays times when notifications are received Time | Name | Node | Host:Port | Current State | Assigned State---------+----------+-------+---------------+---------------------+--------------------05:12:23 | ubuntu13 | 60 | ubuntu13:9300 | primary | draining05:12:23 | ubuntu12 | 68 | ubuntu12:9300 | secondary | prepare_promotion05:12:23 | ubuntu12 | 68 | ubuntu12:9300 | prepare_promotion | prepare_promotion05:12:23 | ubuntu12 | 68 | ubuntu12:9300 | prepare_promotion | stop_replication05:12:23 | ubuntu13 | 60 | ubuntu13:9300 | primary | demote_timeout05:12:23 | ubuntu13 | 60 | ubuntu13:9300 | draining | demote_timeout05:12:23 | ubuntu13 | 60 | ubuntu13:9300 | demote_timeout | demote_timeout05:12:24 | ubuntu12 | 68 | ubuntu12:9300 | stop_replication | stop_replication05:12:24 | ubuntu12 | 68 | ubuntu12:9300 | stop_replication | wait_primary05:12:24 | ubuntu13 | 60 | ubuntu13:9300 | demote_timeout | demoted05:12:24 | ubuntu13 | 60 | ubuntu13:9300 | demoted | demoted05:12:24 | ubuntu12 | 68 | ubuntu12:9300 | wait_primary | wait_primary05:12:24 | ubuntu13 | 60 | ubuntu13:9300 | demoted | catchingup05:12:25 | ubuntu13 | 60 | ubuntu13:9300 | catchingup | catchingup05:12:26 | ubuntu13 | 60 | ubuntu13:9300 | catchingup | secondary05:12:26 | ubuntu13 | 60 | ubuntu13:9300 | secondary | secondary05:12:26 | ubuntu12 | 68 | ubuntu12:9300 | wait_primary | primary05:12:26 | ubuntu12 | 68 | ubuntu12:9300 | primary | primaryroot@ubuntu12:~#
从monitor节点也可以看到主备身份发生了变化
root@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 11: 0/5039CC0 | read-only | secondary | secondaryubuntu12 | 68 | ubuntu12:9300 | 11: 0/5039CC0 | read-write | primary | primary
pg_autoctl perform failover需要在monitor节点上执行,其结果会将备用节点提升为主节点。
root@ubuntu11:~# pg_autoctl perform failover05:15:57 774844 INFO Waiting 60 secs for a notification with state "primary" in formation "default" and group 005:15:57 774844 INFO Listening monitor notifications about state changes in formation "default" and group 005:15:57 774844 INFO Following table displays times when notifications are received Time | Name | Node | Host:Port | Current State | Assigned State---------+----------+-------+---------------+---------------------+--------------------05:15:57 | ubuntu12 | 68 | ubuntu12:9300 | primary | draining05:15:57 | ubuntu13 | 60 | ubuntu13:9300 | secondary | prepare_promotion05:15:57 | ubuntu13 | 60 | ubuntu13:9300 | prepare_promotion | prepare_promotion05:15:57 | ubuntu13 | 60 | ubuntu13:9300 | prepare_promotion | stop_replication05:15:57 | ubuntu12 | 68 | ubuntu12:9300 | primary | demote_timeout05:15:57 | ubuntu12 | 68 | ubuntu12:9300 | draining | demote_timeout05:15:57 | ubuntu12 | 68 | ubuntu12:9300 | demote_timeout | demote_timeout05:15:58 | ubuntu13 | 60 | ubuntu13:9300 | stop_replication | stop_replication05:15:58 | ubuntu13 | 60 | ubuntu13:9300 | stop_replication | wait_primary05:15:58 | ubuntu12 | 68 | ubuntu12:9300 | demote_timeout | demoted05:15:58 | ubuntu12 | 68 | ubuntu12:9300 | demoted | demoted05:15:58 | ubuntu13 | 60 | ubuntu13:9300 | wait_primary | wait_primary05:15:58 | ubuntu12 | 68 | ubuntu12:9300 | demoted | catchingup05:15:59 | ubuntu12 | 68 | ubuntu12:9300 | catchingup | catchingup05:16:00 | ubuntu12 | 68 | ubuntu12:9300 | catchingup | secondary05:16:00 | ubuntu12 | 68 | ubuntu12:9300 | secondary | secondary05:16:00 | ubuntu13 | 60 | ubuntu13:9300 | wait_primary | primary05:16:00 | ubuntu13 | 60 | ubuntu13:9300 | primary | primaryroot@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 12: 0/503A0E8 | read-write | primary | primaryubuntu12 | 68 | ubuntu12:9300 | 12: 0/503A0E8 | read-only | secondary | secondaryroot@ubuntu11:~#root@ubuntu11:~#
这里初始化新的机器ubuntu14,IP地址为192.168.152.124。此时主机列表如下,依次修改前三台主机的hosts映射文件,增加ubuntu14的地址与主机名。
192.168.152.121 ubuntu11
192.168.152.122 ubuntu12
192.168.152.123 ubuntu13
192.168.152.124 ubuntu14
使用pg_autoctl 创建注册到monitor节点。
/usr/local/pgsql16/server/bin/pg_autoctl create postgres --hostname ubuntu14 --name ubuntu14 --pgdata /usr/local/pgsql16/pg9300/data/ --pgport 9300 --auth trust --ssl-self-signed --monitor 'postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require'
root@ubuntu14:~# su - postgrespostgres@ubuntu14:~$postgres@ubuntu14:~$postgres@ubuntu14:~$ /usr/local/pgsql16/server/bin/pg_autoctl create postgres --hostname ubuntu14 --name ubuntu14 --pgdata /usr/local/pgsql16/pg9300/data/ --pgport 9300 --auth trust --ssl-self-signed --monitor 'postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require'06:09:25 2643 INFO Using default --ssl-mode "require"06:09:25 2643 INFO Using --ssl-self-signed: pg_autoctl will create self-signed certificates, allowing for encrypted network traffic06:09:25 2643 WARN Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks.06:09:25 2643 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details06:09:25 2643 INFO Started pg_autoctl postgres service with pid 264506:09:25 2643 INFO Started pg_autoctl node-init service with pid 264606:09:25 2645 INFO /usr/local/pgsql16/server/bin/pg_autoctl do service postgres --pgdata /usr/local/pgsql16/pg9300/data/ -v06:09:25 2646 INFO Registered node 75 "ubuntu14" (ubuntu14:9300) in formation "default", group 0, state "wait_standby"06:09:25 2646 INFO Writing keeper state file at "/home/postgres/.local/share/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.state"06:09:25 2646 INFO Writing keeper init state file at "/home/postgres/.local/share/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.init"06:09:25 2646 INFO Successfully registered as "wait_standby" to the monitor.06:09:25 2646 INFO FSM transition from "init" to "wait_standby": Start following a primary06:09:25 2646 INFO Transition complete: current state is now "wait_standby"06:09:26 2646 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby06:09:26 2646 INFO Initialising PostgreSQL as a hot standby06:09:26 2646 INFO /usr/local/pgsql16/server/bin/pg_basebackup -w -d 'application_name=pgautofailover_standby_75 host=ubuntu13 port=9300 user=pgautofailover_replicator sslmode=require' --pgdata /usr/local/pgsql16/pg9300/backup/node_75 -U pgautofailover_replicator --verbose --progress --max-rate 100M --wal-method=stream --slot pgautofailover_standby_7506:09:26 2646 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete06:09:26 2646 INFO pg_basebackup: checkpoint completed06:09:26 2646 INFO pg_basebackup: write-ahead log start point: 0/6000028 on timeline 1206:09:26 2646 INFO pg_basebackup: starting background WAL receiver06:09:26 2646 INFO 22759/22759 kB (100%), 0/1 tablespace (...backup/node_75/global/pg_control)06:09:26 2646 INFO 22759/22759 kB (100%), 1/1 tablespace06:09:26 2646 INFO pg_basebackup:06:09:26 2646 INFO06:09:26 2646 INFO write-ahead log end point: 0/600010006:09:26 2646 INFO pg_basebackup:06:09:26 2646 INFO06:09:26 2646 INFO waiting for background process to finish streaming ...06:09:26 2646 INFO pg_basebackup:06:09:26 2646 INFO06:09:26 2646 INFO syncing data to disk ...06:09:27 2646 INFO pg_basebackup:06:09:27 2646 INFO06:09:27 2646 INFO renaming backup_manifest.tmp to backup_manifest06:09:27 2646 INFO pg_basebackup:06:09:27 2646 INFO06:09:27 2646 INFO base backup completed06:09:27 2646 INFO Creating the standby signal file at "/usr/local/pgsql16/pg9300/data/standby.signal", and replication setup at "/usr/local/pgsql16/pg9300/data/postgresql-auto-failover-standby.conf"06:09:27 2646 INFO Contents of "/usr/local/pgsql16/pg9300/data/postgresql-auto-failover-standby.conf" have changed, overwriting06:09:27 2646 INFO /usr/bin/openssl req -new -x509 -days 365 -nodes -text -out /usr/local/pgsql16/pg9300/data/server.crt -keyout /usr/local/pgsql16/pg9300/data/server.key -subj "/CN=ubuntu14"06:09:27 2668 INFO /usr/local/pgsql16/server/bin/postgres -D /usr/local/pgsql16/pg9300/data -p 9300 -h *06:09:27 2646 INFO PostgreSQL started on port 930006:09:27 2646 INFO Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes.06:09:27 2646 INFO Ensuring HBA rules for node 60 "ubuntu13" (ubuntu13:9300)06:09:27 2646 INFO Ensuring HBA rules for node 68 "ubuntu12" (ubuntu12:9300)06:09:27 2646 INFO Reloading Postgres configuration and HBA rules06:09:27 2645 INFO Postgres is now serving PGDATA "/usr/local/pgsql16/pg9300/data" on port 9300 with pid 266806:09:27 2646 INFO Transition complete: current state is now "catchingup"06:09:27 2646 INFO keeper has been successfully initialized.06:09:27 2643 WARN pg_autoctl service node-init exited with exit status 006:09:27 2645 INFO Postgres controller service received signal SIGTERM, terminating06:09:27 2645 INFO Stopping pg_autoctl postgres service06:09:27 2645 INFO /usr/local/pgsql16/server/bin/pg_ctl --pgdata /usr/local/pgsql16/pg9300/data --wait stop --mode fast06:09:27 2643 INFO Stop pg_autoctlpostgres@ubuntu14:~$
然后创建systemctl服务
root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 12: 0/7000060 | read-write | primary | primaryubuntu12 | 68 | ubuntu12:9300 | 12: 0/7000060 | read-only | secondary | secondaryubuntu14 | 75 | ubuntu14:9300 | 12: 0/7000060 | read-only | secondary | secondaryroot@ubuntu11:~#root@ubuntu11:~# pg_autoctl show state Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State---------+-------+---------------+----------------+--------------+---------------------+--------------------ubuntu13 | 60 | ubuntu13:9300 | 12: 0/7000060 | read-write | primary | primaryubuntu12 | 68 | ubuntu12:9300 | 12: 0/7000060 | read-only | secondary | secondaryubuntu14 | 75 | ubuntu14:9300 | 12: 0/7000060 | read-only | secondary | secondaryroot@ubuntu11:~#root@ubuntu11:~#
与此同时,pg_auto_failover会自动将集群升级为强同步模式(number_sync_standbys=1)
root@ubuntu11:~# pg_autoctl get formation settings Context | Name | Setting | Value----------+----------+---------------------------+---------------------------------------------------------------formation | default | number_sync_standbys | 1 primary | ubuntu13 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_68, pgautofailover_standby_75)' node | ubuntu13 | candidate priority | 50 node | ubuntu12 | candidate priority | 50 node | ubuntu14 | candidate priority | 50 node | ubuntu13 | replication quorum | true node | ubuntu12 | replication quorum | true node | ubuntu14 | replication quorum | trueroot@ubuntu11:~#
如果不希望新节点的candidate-priority为默认的50,可以降低其candidate-priority优先级
在ubuntu14自身节点上执行pg_autoctl set node candidate-priority --name ubuntu14 10 --monitor 'postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require'
root@ubuntu14:~# pg_autoctl set node candidate-priority --name ubuntu14 10 --monitor 'postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require'06:26:00 6614 INFO Waiting for the settings to have been applied to the monitor and primary node06:26:00 6614 INFO New state is reported by node 60 "ubuntu13" (ubuntu13:9300): "apply_settings"06:26:00 6614 INFO Setting goal state of node 60 "ubuntu13" (ubuntu13:9300) to primary after it applied replication properties change.06:26:01 6614 INFO New state is reported by node 60 "ubuntu13" (ubuntu13:9300): "primary"10root@ubuntu14:~#
然后从monitor上查看
###candidate priority修改前root@ubuntu11:~# pg_autoctl get formation settings Context | Name | Setting | Value----------+----------+---------------------------+---------------------------------------------------------------formation | default | number_sync_standbys | 1 primary | ubuntu13 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_68, pgautofailover_standby_75)' node | ubuntu13 | candidate priority | 50 node | ubuntu12 | candidate priority | 50 node | ubuntu14 | candidate priority | 50 node | ubuntu13 | replication quorum | true node | ubuntu12 | replication quorum | true node | ubuntu14 | replication quorum | true###candidate priority修改后root@ubuntu11:~#root@ubuntu11:~# pg_autoctl get formation settings Context | Name | Setting | Value----------+----------+---------------------------+---------------------------------------------------------------formation | default | number_sync_standbys | 1 primary | ubuntu13 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_68, pgautofailover_standby_75)' node | ubuntu13 | candidate priority | 50 node | ubuntu12 | candidate priority | 50 node | ubuntu14 | candidate priority | 10 node | ubuntu13 | replication quorum | true node | ubuntu12 | replication quorum | true node | ubuntu14 | replication quorum | true
尝试将ubuntu14节点的quorum设置为false时报错,应该是条件不满足
root@ubuntu14:~# pg_autoctl set node replication-quorum --name ubuntu14 false --pgdata /usr/local/pgsql16/pg9300/data/ --monitor 'postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require'06:30:53 7313 WARN Given --monitor URI, the --pgdata option is ignored06:30:53 7313 INFO Connecting to monitor at "postgres://autoctl_node@ubuntu11:9300/pg_auto_failover?sslmode=require"06:30:53 7313 ERROR Monitor ERROR: can't set replication quorum to false06:30:53 7313 ERROR Monitor DETAIL: At least 2 standby nodes are required in formation default with number_sync_standbys = 1, and only 1 would be participating in the replication quorum06:30:53 7313 ERROR SQL query: SELECT pgautofailover.set_node_replication_quorum($1, $2, $3)06:30:53 7313 ERROR SQL params: 'default', 'ubuntu14', 'false'06:30:53 7313 ERROR Failed to update node replication quorum on node "ubuntu14"in formation "default" for replication_quorum: "false"06:30:53 7313 ERROR Failed to set "replication-quorum" to "false".root@ubuntu14:~#
配置文件查看pg_autoctl show file
monitor节点
root@ubuntu11:~# pg_autoctl show file File | Path--------+---------------- Config | /root/.config/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.cfg Pid | /run/user/0/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.pidroot@ubuntu11:~#
数据节点
root@ubuntu12:~#root@ubuntu12:~# pg_autoctl show file File | Path--------+---------------- Config | /root/.config/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.cfg State | /root/.local/share/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.state Init | /root/.local/share/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.init Pid | /run/user/0/pg_autoctl/usr/local/pgsql16/pg9300/data/pg_autoctl.pid
从monitor中删除节点
pg_autoctl drop node --hostname ubuntu13
清除.local中的文件
pg_autoctl drop node --destory