Distributed spill transformation troubleshooting and recovery¶
- Table of contents
- Distributed spill transformation troubleshooting and recovery
- Gotchas
- Errors in Celery worker window during Celery start-up
- Errors in Celery worker window during operation
- Errors in "multi_process_input_transform" (or "multi_process") client window
- "RabbitMQ cannot be contacted"
- "No Celery nodes are available"
- "Exception when using document store"
- "Celery node(s) failed to configure...expected MAUS release version"
- "No such transform"
- "Celery node(s) failed to configure"
- "WorkerBirthFailedException'"
- "WorkerDeathFailedException'"
- "Celery task ... FAILED"
- "New run detected...waiting for current processing to complete" ... then nothing
- Spills are being submitted but no SUCCESS or FAILURE messages are returned
- Errors in "multi_process_merge_output" (or "multi_process") client window
- Recovery check-lists
This pages summarises error detection and recovery for distributed spill transformation. It complements the general guides of,
- How to set up Celery and RabbitMQ
- RabbitMQ configuration and monitoring
- Celery configuration and monitoring
- How to configure MongoDB as a document cache
- Deploying the MAUS web front-end
Gotchas¶
The following are "gotchas" - or things to be aware of - then using distributed spill transformation.
Ensure MAUS clients and Celery workers have the same value for MAUS_ROOT_DIR¶
MAUS sends configuration information to Celery workers. This can include absolute paths. MAUS deployments must have the same MAUS_ROOT_DIR
path for both the Celery worker(s) and the clients executing spill transformation workflows. Even if Celery workers are running under different hosts, ensure that the path to the MAUS directory is the same name. This is a known issue - #918
Watch MongoDB's storage usage¶
MAUS wipes any collection it uses to store spills prior to using it i.e. it empties it. However, after a run it does not empty it. If you use different collections for each run then you may find the storage used by MongoDB increasing a lot. If RabbitMQ and MongoDB are running on the same host then you may find, if storage grows limited, that RabbitMQ has no space to store pending jobs for Celery and Celery tasks will begin failing badly.
- You can track MongoDB storage usage using the summarise_mongodb.py client.
- You can delete MongoDB collections using the delete_mongodb.py client.
Do NOT resize the Celery worker xterm once you've started Celery¶
Resizing the Celery worker xterm causes a Celery worker's sub-processes to die and new ones to be created. This may cause complications for any currently-running jobs. It is unclear why this arises (it is not a MAUS-specific bug).
Errors in Celery worker window during Celery start-up¶
"write on a pipe with no one to read it"¶
If you see the error,
*** Break *** write on a pipe with no one to read it
then please ignore this. It's unclear as to why this arises, but it is not specific to MAUS and does not seem to have any negative side effects.
It only occurs in the MAUS control room if both the --purge
and -c N
arguments are used and N
is > 2.
"Connection refused"¶
If you see the error,
[2012-03-16 13:33:09,035: WARNING/MainProcess] [Errno 111] Connection refused [2012-03-16 13:33:09,036: INFO/MainProcess] process shutting down
then RabbitMQ is not running or is inaccessible. You should check RabbitMQ.
Errors in Celery worker window during operation¶
"broker forced connection closure"¶
If you see the error,
AMQPConnectionException: (320, u"CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", (0, 0), '')
then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.
"Connection refused"¶
If you see the error,
[2012-03-16 13:30:57,411: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 10 seconds...
then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.
"Couldn't send result"¶
If you see the error,
[2012-03-16 13:41:37,376: WARNING/PoolWorker-2] Couldn't send result for 'd2e08516-ae54-4ec1-b88a-bd84bcac05a5': error(111, 'Connection refused'). Retry in 1.0s.
and after a few seconds it does not resolve itself, then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.
"expected MAUS release version"¶
If you see the error,
[2012-03-16 10:31:39,182: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'maus_version: expected MAUS release version 0.1.4, got MAUS release version 0.1.4', 'error': "<type 'exceptions.ValueError'>"}]}
then the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.
"No such transform"¶
If you see the error,
[2012-03-16 16:50:16,584: INFO/MainProcess] Birthing transform [u'MapPyNewTransform'] [2012-03-16 16:50:16,586: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'No such transform: MapPyNewTransform', 'error': "<type 'exceptions.ValueError'>"}]}
then the MAUS framework has requested a transform that does not exist. This may be because the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.
"Status: {'status':'error'...}"¶
If you see the error,
[2012-03-16 12:12:36,455: INFO/MainProcess] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:12:36,487: INFO/PoolWorker-2] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:12:36,522: WARNING/PoolWorker-2] MapPyGroup [2012-03-16 12:12:36,525: INFO/PoolWorker-1] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:12:36,561: WARNING/PoolWorker-1] MapPyGroup [2012-03-16 12:12:36,562: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'Some transform problem', 'error': "<type 'exceptions.ValueError'>"}]}
then either a current transform threw an exception during its death or a new transform threw an exception during its birth. You should,
"WorkerBirthFailedException'"¶
If you see the error,
[2012-03-16 12:14:30,721: INFO/PoolWorker-1] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:14:30,756: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'MapPyGroup.MapPyGroup returned False', 'error': "<class 'framework.workers.WorkerBirthFailedException'>"}]}
then a new transform returned False during its birth which may hide a more serious problem. You should investigate problems with birth.
"WorkerDeathFailedException'"¶
If you see the error,
[2012-03-16 12:36:11,280: WARNING/PoolWorker-1] DEATH MapPyGroup [2012-03-16 12:36:11,470: INFO/PoolWorker-2] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:36:11,470: WARNING/PoolWorker-2] DEATH MapPyGroup [2012-03-16 12:36:11,471: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'MapPyGroup.MapPyGroup returned False', 'error': "<class 'framework.workers.WorkerDeathFailedException'>"}]}
then a current transform returned False during its death which may hide a more serious problem. You should investigate problems with transform death in Celery.
"WorkerProcessException"¶
If you see,
[2012-03-16 10:51:00,337: ERROR/MainProcess] Task mauscelery.maustasks.MausGenericTransformTask[c5e35e98-5d99-4256-a4ae-9256a1f46844] raised exception: UnpickleableExceptionWrapper('framework.workers', 'WorkerProcessException', (), 'WorkerProcessException()') Traceback (most recent call last): File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace return cls(states.SUCCESS, retval=fun(*args, **kwargs)) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__ return self.run(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run return fun(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform status) WorkerProcessException: WorkerProcessException() Traceback (most recent call last): File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace return cls(states.SUCCESS, retval=fun(*args, **kwargs)) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__ return self.run(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run return fun(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform status) WorkerProcessException: WorkerProcessException()
then an error occurred when transforming a spill.
You may just want to wait to see if this is just a one-off occurrence and leave it (the framework will keep processing spills). But if all spills are failing you should stop your client and investigate spill processing problems.
Errors in "multi_process_input_transform" (or "multi_process") client window¶
"RabbitMQ cannot be contacted"¶
If you see the error,
framework.utilities.RabbitMQException: RabbitMQ cannot be contacted. Problem is [Errno 111] Connection refused
then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.
"No Celery nodes are available"¶
If you see the error,
framework.utilities.NoCeleryNodeException: No Celery nodes are available
then no Celery workers are running or are accessible. You should start one or more Celery workers.
"Exception when using document store"¶
If you see the error,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: could not connect to localhost:27017: [Errno 111] Connection refused
Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: [Errno 111] Connection refused
Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: connection closed
Then MongoDB has not been started or has suddenly stopped. You should check MongoDB.
"Celery node(s) failed to configure...expected MAUS release version"¶
If you see the error,
framework.utilities.CeleryNodeException: Celery node(s) failed to configure: [(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'maus_version: expected MAUS release version 0.1.4, got MAUS release version 0.1.4', u'error': u"<type 'exceptions.ValueError'>"}]})]
then the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.
"No such transform"¶
If you see the error,
Configuring Celery nodes and birthing transforms... Traceback (most recent call last): ... framework.utilities.CeleryNodeException: Celery node(s) failed to configure: [(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'No such transform: MapPyNewTransform', u'error': u"<type 'exceptions.ValueError'>"}]})]
then the MAUS framework has requested a transform that does not exist. This may be because the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.
"Celery node(s) failed to configure"¶
If you see the error,
Configuring Celery nodes and birthing transforms... Traceback (most recent call last): ... framework.utilities.CeleryNodeException: Celery node(s) failed to configure: [(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'Some transform death problem', u'error': u"<type 'exceptions.ValueError'>"}]})]
then either a current transform threw an exception during its death or a new transform threw an exception during its birth. You should
"WorkerBirthFailedException'"¶
If you see the error,
---------- RUN 3386 ---------- Configuring Celery nodes and birthing transforms... Traceback (most recent call last): ... framework.utilities.CeleryNodeException: Celery node(s) failed to configure: [(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'MapPyGroup.MapPyGroup returned False', u'error': u"<class 'framework.workers.WorkerBirthFailedException'>"}]})]
then a new transform returned False during its birth which may hide a more serious problem. You should investigate problems with birth.
"WorkerDeathFailedException'"¶
If you see the error,
New run detected...waiting for current processing to complete ---------- RUN 3386 ---------- Configuring Celery nodes and birthing transforms... Traceback (most recent call last): ... framework.utilities.CeleryNodeException: Celery node(s) failed to configure: [(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'MapPyGroup.MapPyGroup returned False', u'error': u"<class 'framework.workers.WorkerDeathFailedException'>"}]})] Celery worker: [2012-03-16 12:36:11,280: WARNING/PoolWorker-1] DEATH MapPyGroup [2012-03-16 12:36:11,470: INFO/PoolWorker-2] Birthing transform [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] [2012-03-16 12:36:11,470: WARNING/PoolWorker-2] DEATH MapPyGroup [2012-03-16 12:36:11,471: INFO/MainProcess] Status: {'status': 'error', 'error': [{'message': 'MapPyGroup.MapPyGroup returned False', 'error': "<class 'framework.workers.WorkerDeathFailedException'>"}]}
then a current transform returned False during its death which may hide a more serious problem. You should investigate problems with transform death in Celery.
"Celery task ... FAILED"¶
If you see,
Celery task b43d68ce-7ec6-4b74-96d2-a07595a55e44 FAILED Celery task b43d68ce-7ec6-4b74-96d2-a07595a55e44 FAILED : : Traceback (most recent call last): File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace return cls(states.SUCCESS, retval=fun(*args, **kwargs)) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__ return self.run(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run return fun(*args, **kwargs) File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform status) WorkerProcessException: [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] process threw an exception: {'message': 'Some transforming spill problem!', 'error': "<type 'exceptions.ValueError'>"}
then an error occurred when transforming a spill. You may just want to wait to see if this is just a one-off occurrence and leave it (the framework will keep processing spills). But if all spills are failing you should stop your client and investigate spill processing problems.
"New run detected...waiting for current processing to complete" ... then nothing¶
If you see,
New run detected...waiting for current processing to complete
and the client then just sits there for ages, then the Celery worker may have stopped. You should start one or more Celery workers.
Spills are being submitted but no SUCCESS or FAILURE messages are returned¶
If you see many lines like,
INPUT: read next spill Spills input: 147 Processed: 5 Failed 0 Task ID: 0283acb5-5757-44f4-8965-ac9828de7850 1 spills left in buffer INPUT: read next spill Spills input: 148 Processed: 5 Failed 0 Task ID: ead99627-e621-48ec-ab96-799570dc8922 1 spills left in buffer INPUT: read next spill Spills input: 149 Processed: 5 Failed 0 Task ID: 072ce8e5-9a67-4ca4-b58d-7f14be4a650a
But no
SUCCESS
or FAILURE
messages like,Celery task c6b31e02-5294-4550-b00e-d8403f44f7f2 SUCCESS SAVING to collection spills (with ID 5) Spills input: 5 Processed: 5 Failed 0
then the Celery worker may have stopped. You should start one or more Celery workers.
Errors in "multi_process_merge_output" (or "multi_process") client window¶
"Exception when using document store"¶
If you see the error,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: could not connect to localhost:27017: [Errno 111] Connection refused
Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: [Errno 111] Connection refused
Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: connection closed
Then MongoDB has not been started or has suddenly stopped. You should check MongoDB.
"BIRTH merger ... Traceback" or "BIRTH outputter ... Traceback"¶
If you see a Traceback
during the birth of a merger or outputter e.g.
BIRTH merger ReducePyTOFPlot.ReducePyTOFPlot Traceback (most recent call last): ... ValueError: Some merger birth problem
then a merger or outputter threw an exception during its birth. You should investigate problems with birth.
"WorkerBirthFailedException"¶
If you see a WorkerBirthFailedException
during the birth of a merger or outputter e.g.
---------- START RUN 3386 ---------- BIRTH merger ReducePyTOFPlot.ReducePyTOFPlot Traceback (most recent call last): ... framework.workers.WorkerBirthFailedException: ReducePyTOFPlot.ReducePyTOFPlot returned False
then a merger or outputter returned False during its birth which may hide a more serious problem. You should investigate problems with birth.
"DEATH merger ... Traceback" or "DEATH outputter ... Traceback"¶
If you see a Traceback
during the deathof a merger or outputter e.g.
Finishing current run...sending end_of_run to merger DEATH merger ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts Traceback (most recent call last): ... ValueError: Some merger death problem
then a merger or outputter threw an exception during its death. You should investigate problems with death.
"WorkerDeathFailedException"¶
If you see a WorkerDeathFailedException
during the death of a merger or outputter e.g.
Finishing current run...sending end_of_run to merger DEATH merger ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts Traceback (most recent call last): ... framework.workers.WorkerDeathFailedException: ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts returned False
then a merger or outputter returned False during its death which may hide a more serious problem. You should investigate problems with death.
Recovery check-lists¶
Check RabbitMQ¶
- Log in as a super-user by using
sudo su -
orsu
. - Check RabbitMQ is running OK.
$ /sbin/service rabbitmq-server status
- If RabbitMQ is not running then start it up:
$ /sbin/service rabbitmq-server start
- If RabbitMQ is running, check that MAUS is configured with the correct RabbitMQ connection information. For the deployment (framework or Celery worker) that is showing the problem:
- Check
src/common_py/mauscelery/celeryconfig.py
:BROKER_HOST = "localhost" BROKER_PORT = 5672 BROKER_USER = "maus" BROKER_PASSWORD = "suam" BROKER_VHOST = "maushost"
- If the host is not the same as RabbitMQ then the
BROKER_HOST
should be the full host name of the host on which RabbitMQ is running. - If the settings are fine then, but the current host is a different host from RabbitMQ, check that the RabbitMQ host allows connections on port 5672. If not then open the port or ask someone who knows how to do this to do it.
- Check
For more on RabbitMQ and Celery see,
- How to set up Celery and RabbitMQ
- RabbitMQ configuration and monitoring
- Celery configuration and monitoring
Start one or more Celery workers¶
For example,
$ celeryd -l INFO -c 8 --purge
For more on starting Celery workers see,
Ensure consistent MAUS versions¶
MAUS clients need to run the same version of MAUS as the Celery workers they use to process spills.
Ensure that your MAUS clients and Celery workers are using the same version of MAUS. Then start one or more Celery workers and rerun your client.
For more on starting Celery workers see,
Investigate problems with transform death in Celery¶
MAUS Celery components will raise an error if a problem arises when invoking death on their current transforms. It is possible however to configure the worker to recreate and reconfigure the transforms for ongoing use. This can be done just by rerunning your client.
You may also want to, for the benefit of other users, investigate problems with death.
Investigate problems with birth¶
The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be one of:
- A configuration error, in which case check that your MAUS configuration is that expected by the transform by comparing it to the transform's documentation (which should document any expected configuration values). If there is no documentation for the expected configuration values, then raise a new issue.
- An implementation error, in which raise a new issue.
Investigate spill processing problems¶
The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be one of:
- A configuration error, in which case check that your MAUS configuration is that expected by the transform by comparing it to the transform's documentation (which should document any expected configuration values). If there is no documentation for the expected configuration values, then raise a new issue.
- An implementation error, in which raise a new issue.
Investigate problems with transform death in Celery¶
MAUS Celery components will raise an error if a problem arises when invoking death on their current transforms. It is possible however to configure the worker to recreate and reconfigure the transforms for ongoing use. This can be done just by rerunning your client.
You may also want to, for the benefit of other users, investigate problems with death.
Investigate problems with death¶
The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be an implementation error, in which raise a new issue.
Check MongoDB¶
- Log in as a super-user by using
sudo su -
orsu
. - Check MongoDB is running OK.
$ /sbin/service mongod status mongod (pid 4357) is running...
- If MongoDB is not running then start it up:
$ /sbin/service mongod start
- If MongoDB is running, check that MAUS is configured with the correct MongoDB connection information:
- Check
src/common_py/ConfigurationDefaults.py
:mongodb_host = "localhost" mongodb_port = 27017
- If the host is not the same as MongoDB then the
localhost
should be the full host name of the host on which MongoDB is running. - If the settings are fine then, but the current host is a different host from MongoDB, check that the MongoDB host allows connections on port 27017. If not then open the port or ask someone who knows how to do this to do it.
- Check
For more on MongoDB see,
Updated by Jackson, Mike over 11 years ago ยท 50 revisions