Project

General

Profile

Actions

Current configuration

Slave Master
heplnv119.pp.rl.ac.uk heplnv118.pp.rl.ac.uk
130.246.244.119 130.246.244.118
CentoOS7 CentOS7
Public interface http://cdb.mice.rl.ac.uk
Ports open 5432 5432

cdbviewer : tomcat running on heplnv153
cdb : tomcat running on heplnm072


Old configuration

Slave Standby to be Master
configdb.micenet.rl.ac.uk heplnv150.pp.rl.ac.uk heplnm072.pp.rl.ac.uk
172.16.246.25 130.246.92.150 130.246.47.72
SL5 SL7 SL6
Public interface http://cdb.mice.rl.ac.uk http://heplnv152.pp.rl.ac.uk:4443/cdbviewer
Ports open 5432 5432 5432, 8080

Ancient configuration during the running

Master Standby Standby
configdb.micenet.rl.ac.uk heplnm069.pp.rl.ac.uk heplnm072.pp.rl.ac.uk
172.16.246.25 130.246.47.69 130.246.47.72
SL5 SL6 SL6
Public interface (http://heplnv156.pp.rl.ac.uk:4443/cdbviewer/) http://cdb.mice.rl.ac.uk http://heplnv152.pp.rl.ac.uk:4443/cdbviewer
Ports open on MiceNet 5432, 8080, 22 5432, 8080

On heplnm069 and heplnm072:

bash-4.1$ which psql
/usr/pgsql-9.1/bin/psql

CDB Failover procedure

On the standby server (heplnm069 or heplnm072)

Check if the standy is in synch

su postgres
psql -c 'select pg_last_xlog_receive_location() "receive_location", pg_last_xlog_replay_location() "replay_location", pg_is_in_recovery() "recovery_status";'
 receive_location | replay_location | recovery_status 
------------------+-----------------+-----------------
 0/8F11DF00       | 0/8F11DF00      | t

On the master server (configdb)

If the master is still up and running, check if the standy is being synchronized

su postgres
ps aux | grep postgres | grep sender
postgres  3375  0.0  0.0 158824  2968 ?        Ss   Oct31   0:03 postgres: wal sender process postgres 130.246.47.69(38367) streaming 0/8F11DF00
postgres  5215  0.0  0.0 158824  2988 ?        Ss   Nov10   0:01 postgres: wal sender process postgres 130.246.47.72(34951) streaming 0/8F11DF00
create a snapshot of the CDB on /var/lib/pgsql/data/pg_xlog/archive
source /etc/cron.weekly/pg-base-backup
source /etc/cron.weekly/pg-dump
stop postgres as root
service pgsql-cdb stop

On the standby server (heplnm069 or heplnm072)

Promote the standby to master, creating a trigger file as in /var/lib/pgsql/data/recovery.conf

su postgres
touch /var/lib/pgsql/data/failover
check it
psql -c 'select pg_is_in_recovery() "recovery_status";'
 recovery_status 
-----------------
 f

check /opt/mice/etc/cdb-server/cdb.props
emacs -nw /opt/mice/etc/cdb-server/cdb.props

  server.name=MICE Production Server - Master
  db.url=jdbc:postgresql://localhost:5432/
  db.name=cdb
  db.user=mice
  db.pwd=****
  db.superUser=supermouse
  db.superPwd=****
check if /var/lib/tomcat5/webapps/cdb.war or /var/lib/tomcat/webapps/cdb.war is the correct one (copied from the former master).

Check if /var/lib/pgsql/data/pg_hba.conf contains everything you need to communicate with the other machine

# "local" is for Unix domain socket connections only
local   cdb             mice,supermouse                         md5     
local   cdb             postgres                                md5
local   cdb             all                                     reject
local   all             all                                     peer
# IPv4 local connections:
host    cdb             mice,supermouse 127.0.0.1/32            md5     
host    cdb             mice            130.246.92.152/32       md5
host    cdb             mice            130.246.92.156/32       md5
host    cdb             all             0.0.0.0/0               reject
host    replication     postgres        172.16.246.25/22        trust
host    replication     postgres        130.246.47.69/22        trust
host    replication     postgres        130.246.47.72/22        trust
host    replication     postgres        127.0.0.1/32            trust
host    all             all             127.0.0.1/32            ident

As root restart postgres
service pgsql-cdb restart
and start the dormient tomcat
service tomcat start

On the old primary server (configdb)

Make sure that postgres has been stopped.

Clean /data

cd /var/lib/pgsql/data
rm -rf pg_xlog/*

as root

umount /var/lib/pgsql/data/pg_xlog
su postgres
cd /var/lib/pgsql/
rm -rf data/*

Take a backup of the new server

/usr/pgsql-9.1/bin/pg_basebackup -h 130.246.47.72(69) -D /var/lib/pgsql/data -U postgres -v -P

Create /var/lib/pgsql/data/recovery.conf (or copy from /var/lib/pgsql/recovery.conf.template)

standby_mode          = 'on'
primary_conninfo      = 'host=130.246.47.72(69) port=5432 user=postgres'
trigger_file = '/var/lib/pgsql/data/failover'
restore_command = 'cp /var/lib/pgsql/data/pg_xlog/archive/%f "%p"'

Recreate the mount point for pg_xlog

mv /var/lib/pgsql/data/pg_xlog /tmp/
mkdir /var/lib/pgsql/data/pg_xlog

as root

mount -a

su postgres
chmod 700 /var/lib/pgsql/data
cp -rp /tmp/pg_xlog/* /var/lib/pgsql/data/pg_xlog/
mkdir /var/lib/pgsql/data/pg_xlog/archive

Restart from root as standby

service pgsql-cdb start

On any other remaining standby server (heplnm069 or heplnm072)

emacs -nw /var/lib/postgresql/9.2/main/recovery.conf

  recovery_target_timeline = 'latest'
service pgsql-cdb restart

C&M configuration

http://configdb.micenet.rl.ac.uk and http://172.16.246.25 are hardcoded in several places and should be changed to http://heplnm069.pp.rl.ac.uk or http://heplnm072.pp.rl.ac.uk:
  • Run Control:
    iocTops/RunControl/get_tags.py:    blm_super = BeamlineSuperMouse("http://configdb.micenet.rl.ac.uk:8080")
    iocTops/RunControl/set_cdb_beamline_for_tag.py:    blm_super = BeamlineSuperMouse("http://configdb.micenet.rl.ac.uk:8080")
    iocTops/RunControl/iocBoot/iocRunControl/st.cmd:epicsEnvSet("CDB_SERVER","http://configdb.micenet.rl.ac.uk:8080")
    iocTops/RunControl/RunControlApp/src/RunControl.c:  else if (!strcmp(server,"configdb.micenet.rl.ac.uk")) {
    iocTops/RunControl/get_cdb_beamline_for_tag.py:    blm_super = BeamlineSuperMouse("http://configdb.micenet.rl.ac.uk:8080")
    
  • Other EPICS bits:
    Config/ProcLauncher/MICE-SM.bash:export CDB_SERVER=http://configdb.micenet.rl.ac.uk
    iocTops/BeamLine/iocBoot/iocBeamLine/st.cmd:epicsEnvSet("CDB_SERVER","http://configdb.micenet.rl.ac.uk:8080")
    Software/StateMachineConfig/NOTES: MCDB see http://micewww.pp.rl.ac.uk/projects/configdb/wiki#Python-Client
    Software/UtilityScripts/Soft_IOC_Launcher/MICE-SM.bash:export CDB_SERVER=http://configdb.micenet.rl.ac.uk
    Software/UtilityScripts/Soft_IOC_Launcher/RunControl.bash:export CDB_SERVER=http://configdb.micenet.rl.ac.uk:8080
    
    Software/StateMachineConfig/convert/cdb_configuration.py:    "PROD": 'http://172.16.246.25:8080',
  • Non-EPICS stuff?

Switchback

On the new primary (heplnm069 or heplnm072)

Check if is in synch

su postgres
psql -c 'select pg_last_xlog_receive_location() "receive_location", pg_last_xlog_replay_location() "replay_location", pg_is_in_recovery() "recovery_status";'

As root stop postgres

service pgsql-cdb stop
stop tomcat
service tomcat stop

On the new standby (configdb)

Check the status of the standby, before promoting, if it is in complete sync

psql -c 'select pg_last_xlog_receive_location() "receive_location", pg_last_xlog_replay_location() "replay_location", pg_is_in_recovery() "recovery_status";'
promote back the standby to master
touch /var/lib/pgsql/data/failover

check /var/lib/pgsql/data/pg_hba.conf

this missed directory should already be present

mkdir /var/lib/pgsql/data/pg_xlog/archive

Restart as root

service pgsql-cdb restart

Restart tomcat

service tomcat restart

On the new primary (heplnm069 or heplnm072)

Stop postgres

service pgsql-cdb stop

Clean /data

cd /var/lib/pgsql/data
rm -rf pg_xlog/*

as root

umount /var/lib/pgsql/data/pg_xlog
su postgres
cd /var/lib/pgsql/
rm -rf data/*

Take a backup of the new server

su postgres
/usr/pgsql-9.1/bin/pg_basebackup -h 172.16.246.25 -D /var/lib/pgsql/data -U postgres -v -P

Create /var/lib/pgsql/data/recovery.conf (or copy from /var/lib/pgsql/recovery.conf.template)

standby_mode          = 'on'
primary_conninfo      = 'host=172.16.246.25 port=5432 user=postgres'
trigger_file = '/var/lib/pgsql/data/failover'
restore_command = 'cp /var/lib/pgsql/data/pg_xlog/archive/%f "%p"'

recreate the mount point for pg_xlog

mv /var/lib/pgsql/data/pg_xlog /tmp/
mkdir /var/lib/pgsql/data/pg_xlog

as root

mount -a

su postgres
cp -rp /tmp/pg_xlog/* /var/lib/pgsql/data/pg_xlog/
mkdir /var/lib/pgsql/data/pg_xlog/archive

Restart new primary as new standby as root

service pgsql-cdb start

C&M configuration

  • Restore the previous configuration using: http://configdb.micenet.rl.ac.uk:8080

Updated by Franchini, Paolo over 2 years ago ยท 118 revisions