tail -f /dev/null

If you haven't had any obstacles lately, you're not challenging. be the worst.

Pacemaker で ping_monitor が fail した場合の対応

Environment

  • pacemaker.x86_64 2.1.0-8.el8
  • corosync.x86_64 3.1.5-1.el8
  • pgpool-II-pg13.x86_64 4.2.4-1pgdg.rhel8
  • PostgreSQL13

Issus

名前解決が一時的に不可となった 等の理由で ping-monitor が timeout する場合がある.

crm_mon -rfA
Migration Summary:
  * Node: db-a001:
    * ping: migration-threshold=1 fail-count=1 last-failure='Thu Oct  6 14:39:33 2020'

Failed Resource Actions:
  * ping_monitor_10000 on db-a001 'error' (1): call=32, status='Timed Out', exitreason='', last-rc-change='2020-10-06 14:39:33 +09:00', queued=0ms, exec=0ms

syslog

Oct  6 14:39:33 pacemaker-controld[1996]: error: Result of monitor operation for ping on db-a001: Timed Out
Oct  6 14:39:33 pacemaker-controld[1996]: notice: Transition 0 action 9 (ping_monitor_10000 on db-a001): expected 'ok' but got 'error'
Oct  6 14:39:33 pacemaker-controld[1996]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Oct  6 14:39:33 pacemaker-attrd[1994]: notice: Setting fail-count-ping#monitor_10000[db-a001]: (unset) -> 1
Oct  6 14:39:33 pacemaker-attrd[1994]: notice: Setting last-failure-ping#monitor_10000[db-a001]: (unset) -> 1665034773
Oct  6 14:39:33 pacemaker-schedulerd[1995]: warning: Unexpected result (error) was recorded for monitor of ping:0 on db-a001 at Oct  6 14:39:33 2020

pacemaker log

Oct 06 14:39:33 pacemaker-schedulerd[1995] (pcmk__native_allocate)    info: Resource ping:1 cannot run anywhere
Oct 06 14:39:33 pacemaker-schedulerd[1995] (RecurringOp)      info:  Start recurring monitor (10s) for ping:0 on db-a001
Oct 06 14:39:33 pacemaker-schedulerd[1995] (log_list_item)    notice: Actions: Recover    ping:0              ( db-a001 )
Oct 06 14:39:34 pacemaker-schedulerd[1995] (pcmk__log_transition_summary)     notice: Calculated transition 24, saving inputs in /var/lib/pacemaker/pengine/pe-input-971.bz2
Oct 06 14:39:34 pacemaker-controld  [1996] (handle_response)  info: pe_calc calculation pe_calc-dc-1665034773-121 is obsolete
Oct 06 14:39:34 pacemaker-schedulerd[1995] (unpack_config)    notice: On loss of quorum: Ignore
Oct 06 14:39:34 pacemaker-schedulerd[1995] (unpack_rsc_op_failure)    warning: Unexpected result (error) was recorded for monitor of ping:0 on db-a001 at Oct  6 14:39:33 2020 | rc=1 id=ping_last_failure_0

Recovery

ping resource を cleanup するだけで良い.

pcs resource failcount show
Failcounts for resource 'ping'
  db-a001: 1

pcs resource cleanup ping

cleanup は障害状態となった resource のみ対象となり, pcs resource refresh ping だと全 resource を走査し再検出する.

refresh --full だと状態が不明な resource も対象になる.