Actions
FF 20121401¶
Summary - PPD/MICE computing outage christmas 2012
I would like to arrange a meeting in the week 14-18 January to review the MICE computing outage over christmas. I think Antony is a key person so we should wait for his vacation to finish before settling on a date. I have set up a doodle poll to arrange a time available at:
http://www.doodle.com/3pgk4dhqg58iaepz
I propose the following agenda:
- [[computing-software:Computing_infrastructure|List of services]] operated for MICE at RAL outside of the control room
(Antony Wilson)- Is it complete and correct?
- Service Level Agreement for services (Antony Wilson)
- Is it sufficient?
- Incident on 26th December -> January
- Incident report - what happened when? (Chris Brew)
- Was SLA achieved? (Antony Wilson)
- Recommendations and implementation (Chris Rogers/discussion)
- AOB
I propose that the output of the meeting should be a report covering the salient points above. The issue of RAL wireless networks should be
handled separately.
Notes¶
Present:
Chris Brew
Antony Wilson
Henry Nebrensky
Linda Coney (by phone)
Chris Rogers
Rob Harper
Principle Points¶
- ACTION: Chris Brew and Rob Harper to get a redmine login
- ACTION: Linda Coney to ask Pierrick about EPICS archiver status and plans
- ACTION: Linda Coney to review SLAs
- RECOMMENDATION: Chris Brew to phone MOM in the event of an issue that might affect MICE services 07789 272992
- ACTION: Antony Wilson to investigate if/where services are documented - and append to list of services as new table column
- ACTION: Antony Wilson to investigate if/where services are backed up and who has responsibility - and append to list of services as a new table column.
- Date of next meeting - propose Thursday 7th February at 15.00 GMT
Notes on Agenda¶
- List of services - is it complete and correct?
- Henry raised issue of EPICS archiver status/plans - may be new or existing service that is not known
- ACTION: Linda Coney to ask Pierrick about EPICS archiver status and plans
- Service Level Agreement for services - is it sufficient
- Some discussion, basically we don't know
- ACTION: Linda Coney to review SLAs
- Incident report
- See Chris Brew report
- Additionally Antony commented that on restart Postgres9 failed to restart due to changed file permissions handling in Postgres9 - preventing CDB from restarting properly
- RECOMMENDATION: Chris Brew to phone MOM in the event of an issue that might affect MICE services
- Noted that SLA was achieved
- NBD on MICE servers
- 2 weeks on CDB mirror
Other Discussion¶
- It was noted there is a lack of documentation on services
- ACTION: Antony Wilson to investigate if/where services are documented - and append to list of services as new table column
- It was noted there is a lack of documentation on backups
- ACTION: Antony Wilson to investigate if/where services are backed up and who has responsibility - and append to list of services as a new table column.
- Antony suggested improved monitoring for MICE equipment
- Nagios, Ganglia
- Chris Brew notes that RAL PPD will be unable to provide out-of-hours support at current staffing levels
- Not clear how this can be provided on RAL site
- ISIS probably have their own support for ISIS instruments
- External contractors e.g. Dell?
- HyperV should improve redundancy
- Improved documentation => MOM provides support for critical systems?
- RAL tier 1 centre?
- TBD when we have a set of requirements from Operations (Linda)
- Chris Brew notes that we have no SLA from RAL site
Updated by Rogers, Chris over 10 years ago ยท 10 revisions