Project

General

Profile

Actions

PPD/MICE computing outage christmas 2012

MICE suffered a computing outage during the christmas break that led to a loss of services during operation of the AFC module in building R9 and interfered with development work for service users.

Event Summary

Report on what happened when is available here. Additionally ConfigDB hot backup failed to restart - this was due to a conflict between a Postgres version upgrade and some user permissions changes that only emerged during the system reboot.

Two meetings were held to discuss outcome

Conclusions

  • The service level agreement as agreed with RAL PPD was achieved.
  • Previously, magnets testing was not considered "Operations". We always considered that RAL PPD would host non-critical services and critical services would be hosted on MICENet. However, it has become clear that some critical services are being hosted in RAL PPD (those that require visibility to outside world).

Recommendations

A number of conclusions have been drawn to both improve systems robustness and, where possible, move systems off RAL/PPD system to MICENet or provide an alternative service where possible. A few general points are raised, and then a few details about how each system should be modified. Note that top level (user-visible) services are listed; all dependencies should also be included.

General

  • Systems documentation has been improved http://micewww.pp.rl.ac.uk/projects/computing-software/wiki/Computing_infrastructure.
  • Systems have been prioritised by importance
    • High - used for expert intervention in operations
    • Medium - used for development work
    • Low - used for development work but a workaround exists in case of failure
  • Low priority systems remain as-is
  • Medium priority systems should be placed on the HyperV system to improve robustness
    • HyperV is a set of virtual machines based on three distinct servers with some automated failover system.
  • High priority systems should be placed on the HyperV system and an alternative should be available (e.g. RAL bastion)
  • RAL PPD has responsibility for hardware and will provide at least a NBD service
  • Recommend MICE makes a list of computing experts who can provide a NBD level of service
    • Requires training
    • MICE OM is interface from RAL PPD to MICE - so if there is some system failure, RAL PPD or whoever should contact MOM who contacts MICE.

elog (Ian Taylor)

  • elog master should be hosted on MICENet.
  • The PPD elog should be setup to mirror the MICENet elog (as master-master). Operations should use MICENet elog. Where this is not possible, e.g. SS tests in California, they should host their own elog.

MICE bastion aka mousehole

  • MICE bastion should be placed on the HyperV system
  • PPD is building there own PPD bastion system on the MICE bastion model - this will operate as a backup in case of software failure, but will be hosted on the same hardware.
  • Propose investigate use of RAL bastion as a further layer of backup for experts

EPICS Gateway (Pierrick Hanlet)

  • EPICS Gateway should be placed on the HyperV system
  • Require better documentation of EPICS Gateway - what is it actually doing?
    • Required before we can make any further suggestions (but we would like to provide an alternative way to run this service in the event of a failure)
    • Need to develop local expert who understands EPICS

Redmine and associated web service

  • Redmine to be moved onto HyperV setup

ConfigDB Hot Backup/Slave (Antony Wilson)

  • Config DB functions as both a hot backup to MLCR CDB and a read-only slave for external use
  • Move onto HyperV setup
  • Ensure it is still accessible to the MLCR
    • Needs a test

MAUS test servers

  • No change

MICE hall web cams

  • No change

Updated by Rogers, Chris almost 10 years ago ยท 15 revisions