Project

General

Profile

FF 20121401

Summary - PPD/MICE computing outage christmas 2012

I would like to arrange a meeting in the week 14-18 January to review the MICE computing outage over christmas. I think Antony is a key person so we should wait for his vacation to finish before settling on a date. I have set up a doodle poll to arrange a time available at:

http://www.doodle.com/3pgk4dhqg58iaepz

I propose the following agenda:

  • List of services operated for MICE at RAL outside of the control room
    (Antony Wilson)
    • Is it complete and correct?
  • Service Level Agreement for services (Antony Wilson)
    • Is it sufficient?
  • Incident on 26th December -> January
  • Recommendations and implementation (Chris Rogers/discussion)
  • AOB

I propose that the output of the meeting should be a report covering the salient points above. The issue of RAL wireless networks should be
handled separately.

Notes

Present:
Chris Brew
Antony Wilson
Henry Nebrensky
Linda Coney (by phone)
Chris Rogers
Rob Harper

Principle Points

  • ACTION: Chris Brew and Rob Harper to get a redmine login
  • ACTION: Linda Coney to ask Pierrick about EPICS archiver status and plans
  • ACTION: Linda Coney to review SLAs
  • RECOMMENDATION: Chris Brew to phone MOM in the event of an issue that might affect MICE services 07789 272992
  • ACTION: Antony Wilson to investigate if/where services are documented - and append to list of services as new table column
  • ACTION: Antony Wilson to investigate if/where services are backed up and who has responsibility - and append to list of services as a new table column.
  • Date of next meeting - propose Thursday 7th February at 15.00 GMT

Notes on Agenda

  • List of services - is it complete and correct?
    • Henry raised issue of EPICS archiver status/plans - may be new or existing service that is not known
    • ACTION: Linda Coney to ask Pierrick about EPICS archiver status and plans
  • Service Level Agreement for services - is it sufficient
    • Some discussion, basically we don't know
    • ACTION: Linda Coney to review SLAs
  • Incident report
    • See Chris Brew report
    • Additionally Antony commented that on restart Postgres9 failed to restart due to changed file permissions handling in Postgres9 - preventing CDB from restarting properly
    • RECOMMENDATION: Chris Brew to phone MOM in the event of an issue that might affect MICE services
  • Noted that SLA was achieved
    • NBD on MICE servers
    • 2 weeks on CDB mirror

Other Discussion

  • It was noted there is a lack of documentation on services
    • ACTION: Antony Wilson to investigate if/where services are documented - and append to list of services as new table column
  • It was noted there is a lack of documentation on backups
    • ACTION: Antony Wilson to investigate if/where services are backed up and who has responsibility - and append to list of services as a new table column.
  • Antony suggested improved monitoring for MICE equipment
    • Nagios, Ganglia
  • Chris Brew notes that RAL PPD will be unable to provide out-of-hours support at current staffing levels
    • Not clear how this can be provided on RAL site
    • ISIS probably have their own support for ISIS instruments
    • External contractors e.g. Dell?
    • HyperV should improve redundancy
    • Improved documentation => MOM provides support for critical systems?
    • RAL tier 1 centre?
    • TBD when we have a set of requirements from Operations (Linda)
  • Chris Brew notes that we have no SLA from RAL site

PM-20121225Lab8AirConFailure___Computing_Private___TWiki.pdf (115 KB) Rogers, Chris, 17 January 2013 16:35