NW 障害から復帰後, Pacemaker が multiple-active で blocked 状態になった場合の対応

環境

Pacemaker 1.1.20-5.el7_7.2.x86_64
Corosync 1.1.20-5.el7_7.2-3c4c782f70
Pgpool-Ⅱ 4.0.7
PostgreSQL 9.6

Corosync, Pgpool config

# corosync
totem {
    join: 5000
    token: 100000
}

# pgpool
health_check_period = 300
health_check_max_retries = 3
health_check_retry_delay = 3
search_primary_node_timeout = 1

Pacemaker resource setting

pcs resource defaults resource-stickiness="INFINITY" migration-threshold="1"
pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore
pcs property set symmetric-cluster=false
# NW 接続断時の splitbrain 対策
pcs resource meta <pg_cluster_name> multiple-active=block

pcs resource create ping ocf:pacemaker:ping name="default_ping_set" host_list="***" \
    dampen="5" timeout="45" attempts="3" multiplier="100" use_fping="0" \
    op start interval="0" timeout="90" on-fail="restart" \
    op monitor interval="10" timeout="60" on-fail="restart" start-delay="30" \
    op stop interval="0" timeout="100" on-fail="block" --clone

pcs resource create pgpool lsb:pgpool
pcs resource create resource-hoge lsb:resource-hoge

構成としては

AZ-a系 primary
AZ-c系 secondary (streaming replication)

のような形で, primary で Pgpool が稼働し Pacemaker で各種 process の起動/停止/監視を実現している.

細かな点は次図を参照.

想定するシナリオ

クラウドサービス基盤側で障害が発生し NW に異常あり, 一時接続断の状況から復帰した.

Timeline

a系 primary な状態で NW 障害が発生したと想定し NW を切断する.

Pacemaker が c 系に切り替わり, c 系で Pgpool が起動する.

c系の Pgpool が backend の a系 PostgreSQL へ HealthCheck を行い, c系へ failover する.

# AZ-a
20:54:23 $ ifdown eth0 && sleep 900 && ifup eth0 # NW 900sec 断

# AZ-c
20:54:59 pglog terminating walreceiver due to timeout
20:56:37 syslog LOG: pgpool-II successfully started. version 4.0.7 (torokiboshi)
20:56:41 syslog LOG: execute command: failover
20:59:14 pglog received promote request
20:41:06 pglog redo done at 9E2/F0000000
20:41:06 pglog last completed transaction was at log time 2022-01-01 20:53:28
21:05:35 pglog database system is ready to accept connections

PostgreSQL promote 後も, c で稼働する pgpool の role は standby のままである. この状態でも UPDATE は pgpool 経由で可能.

※ standby を primary にする場合は, $ pcp_promote_node -w -h *** -p 9898 -U pgpool -n 1 で解消する.

900sec 経過後, a の NW が復帰し, a/c で multiple-active 状態が成立する.

$ sudo pcs status
Cluster name: hoge-001
Stack: corosync
Current DC: hoge-c001 (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum

Online: [ hoge-a001 hoge-c001 ]

Full list of resources:

 Clone Set: ping-clone [ping]
     Started: [ hoge-c001 ]
     Stopped: [ hoge-a001 ]
 Resource Group: hoge-001
     pgpool-001     (lsb:pgpool-001):   Started (blocked)[ hoge-a001 hoge-c001 ]
     resource-hoge        (lsb:resource-hoge):       Started (blocked)[ hoge-a001 hoge-c001 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: inactive/disabled
  pcsd: active/enabled

failover 後、pgpool の process は a 系でもちろん残っている.

postgres  7231  0.0  0.0 195376  1724 ?        S    20:54   0:00 pgpool: wait for connection request
postgres  7232  0.0  0.0 195380  1292 ?        S    20:54   0:00 pgpool: PCP: wait for connection request
postgres  7233  0.0  0.0 195376  1484 ?        S    20:54   0:00 pgpool: worker process
postgres 16607  0.0  0.0 195376  1720 ?        S    21:16   0:00 pgpool: wait for connection request
postgres 21110  0.0  0.0 195376  1720 ?        S    21:22   0:00 pgpool: wait for connection request

この時 c系 corosync の log では, error で両系の node が active のため administrator のコマンドを wait している旨が出力されている.

pengine:     info: native_color:       Unmanaged resource pgpool-001 allocated to hoge-c001: active
pengine:     info: native_color:       Unmanaged resource resource-hoge allocated to hoge-a001: active
pengine:    error: native_create_actions:      Resource resource-hoge is active on 2 nodes (waiting for an administrator)
pengine:   notice: native_create_actions:      See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information

blocked な状態になった場合, pcs cluster stop --all --force の実行が, 非常に長い時間が掛かる. およそ 20min 前後.

a系の pacemaker を停止し, c系の blocked が解消されたとしても, a系の pacemaker を復帰させると再び blocked となる.

resource が fail している状態ではないので、もちろん failcount はない. cleanup してももちろん状況は変わらない.

$ sudo pcs resource failcount show
No failcounts

process を kill

a にて sudo pkill -f pgpool で子プロセスごと kill すると, a のみ pacemaker ごと落ちる.

pkill により blocked は pgpool resource のみ解消する. しかし, resource-hoge は blocked なままの状態となる.

process を kill しても process が復活する

$ pcs cluster kill は pacemaker が systemd 経由で process を停止に掛かっている. なので手元で pkill -f pacemakerd, pkill -f pgpool しても同様の結果となる.

root     31256  0.0  0.0 171772 20496 ?        S    00:41   0:00 /usr/bin/python -Es /usr/sbin/pcs cluster stop --pacemaker --force
root     31259  0.0  0.0  24836  1288 ?        S    00:41   0:00 systemctl stop pacemaker
root     31510  0.0  0.0 171772 20496 ?        S    00:43   0:00 /usr/bin/python -Es /usr/sbin/pcs cluster stop --pacemaker --force
root     31513  0.0  0.0  24836  1284 ?        S    00:43   0:00 systemctl stop pacemaker

blocked な状態で a 系にて $ sudo pcs cluster kill を実行すると, 一時的に c のみに blocked 無しで resource が切り替わる. しかしその後, a で pacemaker を起動すると c の resource が blocked に戻る.

a が active となっていることを記憶してしまっているようだ.

log を見るに kill -9 で pacemaker, corosync を kill した後に failed status になり systemd が再起動している. これは pacemaker が意図している挙動では無さそう.

systemd: pacemaker.service: main process exited, code=killed, status=9/KILL
systemd: Unit pacemaker.service entered failed state.
systemd: pacemaker.service failed.
systemd: corosync.service: main process exited, code=killed, status=9/KILL
systemd: Unit corosync.service entered failed state.
systemd: corosync.service failed.
systemd: pacemaker.service holdoff time over, scheduling restart.
systemd: Starting Corosync Cluster Engine...

resource move

pcs resource move を試してみるも, active in 2 location な状態では move 出来ない模様.

$ sudo pcs resource move resource-hoge (hoge-c001)
Error: error moving/banning/clearing resource
Resource 'resource-hoge' not moved: active in 2 locations.
To prevent 'resource-hoge' from running on a specific location, specify a node.
Error performing operation: Invalid argument

resource clear

pcs resource clear を試してみるも, 状況は変化無かった.

$ sudo pcs resource clear resource-hoge

Clone Set: ping-clone [ping]
     Started: [ hoge-a001 hoge-c001 ]
Resource Group: hoge-001
     pgpool-001     (lsb:pgpool-001):   Started hoge-c001 (blocked)
     resource-hoge        (lsb:resource-hoge):       Started (blocked)[ hoge-a001 hoge-c001 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

resource disable

a にて例えば pgpool resource を disable すると, pgpool の status が disabled (a), blocked (c) になる.

# pcs resource disable pgpool-001
Warning: 'pgpool-001' is unmanaged

# pcs status
 Resource Group: hoge-001
     pgpool-001     (lsb:pgpool-001):   Started (disabled, blocked)[ hoge-a001 hoge-c001 ]
     resource-hoge        (lsb:resource-hoge):       Started (blocked)[ hoge-a001 hoge-c001 ]

disable しても pgpool の process は残っている.

この状態で c にて enable にすると, 元の blocked な状態に戻る.

# pcs resource enable pgpool-001
Warning: 'pgpool-001' is unmanaged

 Clone Set: ping-clone [ping]
     Started: [ hoge-c001 ]
     Stopped: [ hoge-a001 ]
 Resource Group: hoge-001
     pgpool-001     (lsb:pgpool-001):   Started (blocked)[ hoge-a001 hoge-c001 ]
     resource-hoge        (lsb:resource-hoge):       Started (blocked)[ hoge-a001 hoge-c001 ]

a にて resource-hoge を disable してみると, cの方も disable & stopped になってしまう.

$ sudo pcs resource disable resource-hoge
Warning: 'resource-hoge' is unmanaged

 Clone Set: ping-clone [ping]
     Started: [ hoge-a001 hoge-c001 ]
 Resource Group: hoge-001
     pgpool-001     (lsb:pgpool-001):   Started hoge-c001
     resource-hoge        (lsb:resource-hoge):       Stopped (disabled)

この状態で a にて $ sudo pcs cluster stop --force して c で enable にすると, blocked は解消され, 正常な status となった.

pcs resource enable resource-hoge

ただ, diable -> enable だといずれにせよ resource を止めねばならず, であれば downtime ありきで $ sudo pcs cluster stop --all --force して c で pacemaker 起動, その後 a で起動させた方が良さそう.

blocked な状態でも Pgpool 経由でのリクエストは問題無く捌けていた為, blocked な状態は failed な status でない限りは対応をゆっくりと検討すべきと思われる.

tail -f /dev/null

If you haven't had any obstacles lately, you're not challenging. be the worst.