Project

General

Profile

Actions

Distributed spill transformation troubleshooting and recovery

This pages summarises error detection and recovery for distributed spill transformation. It complements the general guides of,

Gotchas

The following are "gotchas" - or things to be aware of - then using distributed spill transformation.

Ensure MAUS clients and Celery workers have the same value for MAUS_ROOT_DIR

MAUS sends configuration information to Celery workers. This can include absolute paths. MAUS deployments must have the same MAUS_ROOT_DIR path for both the Celery worker(s) and the clients executing spill transformation workflows. Even if Celery workers are running under different hosts, ensure that the path to the MAUS directory is the same name. This is a known issue - #918

Watch MongoDB's storage usage

MAUS wipes any collection it uses to store spills prior to using it i.e. it empties it. However, after a run it does not empty it. If you use different collections for each run then you may find the storage used by MongoDB increasing a lot. If RabbitMQ and MongoDB are running on the same host then you may find, if storage grows limited, that RabbitMQ has no space to store pending jobs for Celery and Celery tasks will begin failing badly.

Do NOT resize the Celery worker xterm once you've started Celery

Resizing the Celery worker xterm causes a Celery worker's sub-processes to die and new ones to be created. This may cause complications for any currently-running jobs. It is unclear why this arises (it is not a MAUS-specific bug).

Errors in Celery worker window during Celery start-up

"write on a pipe with no one to read it"

If you see the error,

*** Break *** write on a pipe with no one to read it

then please ignore this. It's unclear as to why this arises, but it is not specific to MAUS and does not seem to have any negative side effects.

It only occurs in the MAUS control room if both the --purge and -c N arguments are used and N is > 2.

"Connection refused"

If you see the error,

[2012-03-16 13:33:09,035: WARNING/MainProcess] [Errno 111] Connection refused
[2012-03-16 13:33:09,036: INFO/MainProcess] process shutting down

then RabbitMQ is not running or is inaccessible. You should check RabbitMQ.

Errors in Celery worker window during operation

"broker forced connection closure"

If you see the error,

AMQPConnectionException: (320, u"CONNECTION_FORCED - broker forced connection closure 
with reason 'shutdown'", (0, 0), '')

then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.

"Connection refused"

If you see the error,

[2012-03-16 13:30:57,411: ERROR/MainProcess] Consumer: Connection Error: 
[Errno 111] Connection refused. Trying again in 10 seconds...

then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.

"Couldn't send result"

If you see the error,

[2012-03-16 13:41:37,376: WARNING/PoolWorker-2] Couldn't send result for 
'd2e08516-ae54-4ec1-b88a-bd84bcac05a5': error(111, 'Connection refused'). Retry in 1.0s.

and after a few seconds it does not resolve itself, then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.

"expected MAUS release version"

If you see the error,

[2012-03-16 10:31:39,182: INFO/MainProcess] Status: {'status': 'error',
'error': [{'message': 'maus_version: expected MAUS release version 0.1.4, 
got MAUS release version 0.1.4', 'error': "<type 'exceptions.ValueError'>"}]}

then the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.

"No such transform"

If you see the error,

[2012-03-16 16:50:16,584: INFO/MainProcess] Birthing transform [u'MapPyNewTransform']
[2012-03-16 16:50:16,586: INFO/MainProcess] Status: {'status': 'error', 'error': 
[{'message': 'No such transform: MapPyNewTransform', 'error': "<type 'exceptions.ValueError'>"}]}

then the MAUS framework has requested a transform that does not exist. This may be because the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.

"Status: {'status':'error'...}"

If you see the error,

[2012-03-16 12:12:36,455: INFO/MainProcess] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:12:36,487: INFO/PoolWorker-2] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:12:36,522: WARNING/PoolWorker-2] MapPyGroup
[2012-03-16 12:12:36,525: INFO/PoolWorker-1] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:12:36,561: WARNING/PoolWorker-1] MapPyGroup
[2012-03-16 12:12:36,562: INFO/MainProcess] Status: {'status': 'error', 'error': 
[{'message': 'Some transform problem', 'error': "<type 'exceptions.ValueError'>"}]}

then either a current transform threw an exception during its death or a new transform threw an exception during its birth. You should,

"WorkerBirthFailedException'"

If you see the error,

[2012-03-16 12:14:30,721: INFO/PoolWorker-1] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:14:30,756: INFO/MainProcess] Status: {'status': 'error', 'error': 
[{'message': 'MapPyGroup.MapPyGroup returned False', 'error': 
"<class 'framework.workers.WorkerBirthFailedException'>"}]}

then a new transform returned False during its birth which may hide a more serious problem. You should investigate problems with birth.

"WorkerDeathFailedException'"

If you see the error,

[2012-03-16 12:36:11,280: WARNING/PoolWorker-1] DEATH MapPyGroup
[2012-03-16 12:36:11,470: INFO/PoolWorker-2] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:36:11,470: WARNING/PoolWorker-2] DEATH MapPyGroup
[2012-03-16 12:36:11,471: INFO/MainProcess] Status: {'status': 'error', 'error':
 [{'message': 'MapPyGroup.MapPyGroup returned False', 'error': 
"<class 'framework.workers.WorkerDeathFailedException'>"}]}

then a current transform returned False during its death which may hide a more serious problem. You should investigate problems with transform death in Celery.

"WorkerProcessException"

If you see,

[2012-03-16 10:51:00,337: ERROR/MainProcess]
Task mauscelery.maustasks.MausGenericTransformTask[c5e35e98-5d99-4256-a4ae-9256a1f46844] raised exception: 
UnpickleableExceptionWrapper('framework.workers', 'WorkerProcessException', (), 'WorkerProcessException()')
Traceback (most recent call last):
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__
    return self.run(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run
    return fun(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform
    status)
WorkerProcessException: WorkerProcessException()
Traceback (most recent call last):
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__
    return self.run(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run
    return fun(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform
    status)
WorkerProcessException: WorkerProcessException()

then an error occurred when transforming a spill.

You may just want to wait to see if this is just a one-off occurrence and leave it (the framework will keep processing spills). But if all spills are failing you should stop your client and investigate spill processing problems.

Errors in "multi_process_input_transform" (or "multi_process") client window

"RabbitMQ cannot be contacted"

If you see the error,

framework.utilities.RabbitMQException: RabbitMQ cannot be contacted. Problem is [Errno 111] Connection refused

then RabbitMQ has stopped or has become unavailable. You should check RabbitMQ.

"No Celery nodes are available"

If you see the error,

framework.utilities.NoCeleryNodeException: No Celery nodes are available

then no Celery workers are running or are accessible. You should start one or more Celery workers.

"Exception when using document store"

If you see the error,

docstore.DocumentStore.DocumentStoreException: Exception when using document store: 
could not connect to localhost:27017: [Errno 111] Connection refused

Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: 
[Errno 111] Connection refused

Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: connection closed

Then MongoDB has not been started or has suddenly stopped. You should check MongoDB.

"Celery node(s) failed to configure...expected MAUS release version"

If you see the error,

framework.utilities.CeleryNodeException: Celery node(s) failed to configure: 
[(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': [{u'message': u'maus_version: 
expected MAUS release version 0.1.4, got MAUS release version 0.1.4', 
u'error': u"<type 'exceptions.ValueError'>"}]})]

then the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.

"No such transform"

If you see the error,

Configuring Celery nodes and birthing transforms...
Traceback (most recent call last):
...
framework.utilities.CeleryNodeException: Celery node(s) failed to configure: 
[(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': 
[{u'message': u'No such transform: MapPyNewTransform', u'error': u"<type 'exceptions.ValueError'>"}]})]

then the MAUS framework has requested a transform that does not exist. This may be because the MAUS framework is running under a different version of MAUS than the Celery workers. You should ensure consistent MAUS versions.

"Celery node(s) failed to configure"

If you see the error,

Configuring Celery nodes and birthing transforms...
Traceback (most recent call last):
...
framework.utilities.CeleryNodeException: Celery node(s) failed to configure: 
[(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': 
[{u'message': u'Some transform death problem', u'error': u"<type 'exceptions.ValueError'>"}]})]

then either a current transform threw an exception during its death or a new transform threw an exception during its birth. You should

"WorkerBirthFailedException'"

If you see the error,

---------- RUN 3386 ----------
Configuring Celery nodes and birthing transforms...
Traceback (most recent call last):
...
framework.utilities.CeleryNodeException: Celery node(s) failed to configure: 
[(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': 
[{u'message': u'MapPyGroup.MapPyGroup returned False', u'error': 
u"<class 'framework.workers.WorkerBirthFailedException'>"}]})]

then a new transform returned False during its birth which may hide a more serious problem. You should investigate problems with birth.

"WorkerDeathFailedException'"

If you see the error,

New run detected...waiting for current processing to complete
---------- RUN 3386 ----------
Configuring Celery nodes and birthing transforms...
Traceback (most recent call last):
...
framework.utilities.CeleryNodeException: Celery node(s) failed to configure: 
[(u'maus.epcc.ed.ac.uk', {u'status': u'error', u'error': 
[{u'message': u'MapPyGroup.MapPyGroup returned False', u'error': 
u"<class 'framework.workers.WorkerDeathFailedException'>"}]})]
Celery worker:
[2012-03-16 12:36:11,280: WARNING/PoolWorker-1] DEATH MapPyGroup
[2012-03-16 12:36:11,470: INFO/PoolWorker-2] Birthing transform 
[u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints']
[2012-03-16 12:36:11,470: WARNING/PoolWorker-2] DEATH MapPyGroup
[2012-03-16 12:36:11,471: INFO/MainProcess] Status: {'status': 'error', 'error': 
[{'message': 'MapPyGroup.MapPyGroup returned False', 'error': 
"<class 'framework.workers.WorkerDeathFailedException'>"}]}

then a current transform returned False during its death which may hide a more serious problem. You should investigate problems with transform death in Celery.

"Celery task ... FAILED"

If you see,

Celery task b43d68ce-7ec6-4b74-96d2-a07595a55e44 FAILED
 Celery task b43d68ce-7ec6-4b74-96d2-a07595a55e44 FAILED :  : Traceback (most recent call last):
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/execute/trace.py", line 47, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/task/__init__.py", line 247, in __call__    return self.run(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/third_party/install/lib/python2.7/site-packages/celery-2.4.6-py2.7.egg/celery/app/__init__.py", line 175, in run
    return fun(*args, **kwargs)
  File "/home/michaelj/maus-bzr/maus/src/common_py/mauscelery/tasks.py", line 55, in execute_transform
    status)
WorkerProcessException: [u'MapCppTOFDigits', u'MapCppTOFSlabHits', u'MapCppTOFSpacePoints'] process threw an exception: {'message': 'Some transforming spill problem!', 'error': "<type 'exceptions.ValueError'>"}

then an error occurred when transforming a spill. You may just want to wait to see if this is just a one-off occurrence and leave it (the framework will keep processing spills). But if all spills are failing you should stop your client and investigate spill processing problems.

"New run detected...waiting for current processing to complete" ... then nothing

If you see,

New run detected...waiting for current processing to complete

and the client then just sits there for ages, then the Celery worker may have stopped. You should start one or more Celery workers.

Spills are being submitted but no SUCCESS or FAILURE messages are returned

If you see many lines like,

INPUT: read next spill
Spills input: 147 Processed: 5 Failed 0
Task ID: 0283acb5-5757-44f4-8965-ac9828de7850
1 spills left in buffer
INPUT: read next spill
Spills input: 148 Processed: 5 Failed 0
Task ID: ead99627-e621-48ec-ab96-799570dc8922
1 spills left in buffer
INPUT: read next spill
Spills input: 149 Processed: 5 Failed 0
Task ID: 072ce8e5-9a67-4ca4-b58d-7f14be4a650a

But no SUCCESS or FAILURE messages like,
 Celery task c6b31e02-5294-4550-b00e-d8403f44f7f2 SUCCESS 
   SAVING to collection spills (with ID 5)
Spills input: 5 Processed: 5 Failed 0

then the Celery worker may have stopped. You should start one or more Celery workers.

Errors in "multi_process_merge_output" (or "multi_process") client window

"Exception when using document store"

If you see the error,

docstore.DocumentStore.DocumentStoreException: Exception when using document store: 
could not connect to localhost:27017: [Errno 111] Connection refused

Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: 
[Errno 111] Connection refused

Or,
docstore.DocumentStore.DocumentStoreException: Exception when using document store: connection closed

Then MongoDB has not been started or has suddenly stopped. You should check MongoDB.

"BIRTH merger ... Traceback" or "BIRTH outputter ... Traceback"

If you see a Traceback during the birth of a merger or outputter e.g.

BIRTH merger ReducePyTOFPlot.ReducePyTOFPlot
Traceback (most recent call last):
...
ValueError: Some merger birth problem

then a merger or outputter threw an exception during its birth. You should investigate problems with birth.

"WorkerBirthFailedException"

If you see a WorkerBirthFailedException during the birth of a merger or outputter e.g.

---------- START RUN 3386 ----------
BIRTH merger ReducePyTOFPlot.ReducePyTOFPlot
Traceback (most recent call last):
...
framework.workers.WorkerBirthFailedException: ReducePyTOFPlot.ReducePyTOFPlot returned False

then a merger or outputter returned False during its birth which may hide a more serious problem. You should investigate problems with birth.

"DEATH merger ... Traceback" or "DEATH outputter ... Traceback"

If you see a Traceback during the deathof a merger or outputter e.g.

Finishing current run...sending end_of_run to merger
DEATH merger ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts
Traceback (most recent call last):
...
ValueError: Some merger death problem

then a merger or outputter threw an exception during its death. You should investigate problems with death.

"WorkerDeathFailedException"

If you see a WorkerDeathFailedException during the death of a merger or outputter e.g.

Finishing current run...sending end_of_run to merger
DEATH merger ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts
Traceback (most recent call last):
...
framework.workers.WorkerDeathFailedException: ReducePyHistogramTDCADCCounts.ReducePyHistogramTDCADCCounts returned False

then a merger or outputter returned False during its death which may hide a more serious problem. You should investigate problems with death.

Recovery check-lists

Check RabbitMQ

  • Log in as a super-user by using sudo su - or su.
  • Check RabbitMQ is running OK.
    $ /sbin/service rabbitmq-server status
    
  • If RabbitMQ is not running then start it up:
    $ /sbin/service rabbitmq-server start
    
  • If RabbitMQ is running, check that MAUS is configured with the correct RabbitMQ connection information. For the deployment (framework or Celery worker) that is showing the problem:
    • Check src/common_py/mauscelery/celeryconfig.py:
      BROKER_HOST = "localhost" 
      BROKER_PORT = 5672
      BROKER_USER = "maus" 
      BROKER_PASSWORD = "suam" 
      BROKER_VHOST = "maushost" 
      
    • If the host is not the same as RabbitMQ then the BROKER_HOST should be the full host name of the host on which RabbitMQ is running.
    • If the settings are fine then, but the current host is a different host from RabbitMQ, check that the RabbitMQ host allows connections on port 5672. If not then open the port or ask someone who knows how to do this to do it.

For more on RabbitMQ and Celery see,

Start one or more Celery workers

For example,

$ celeryd -l INFO -c 8 --purge

For more on starting Celery workers see,

Ensure consistent MAUS versions

MAUS clients need to run the same version of MAUS as the Celery workers they use to process spills.

Ensure that your MAUS clients and Celery workers are using the same version of MAUS. Then start one or more Celery workers and rerun your client.

For more on starting Celery workers see,

Investigate problems with transform death in Celery

MAUS Celery components will raise an error if a problem arises when invoking death on their current transforms. It is possible however to configure the worker to recreate and reconfigure the transforms for ongoing use. This can be done just by rerunning your client.

You may also want to, for the benefit of other users, investigate problems with death.

Investigate problems with birth

The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be one of:

  • A configuration error, in which case check that your MAUS configuration is that expected by the transform by comparing it to the transform's documentation (which should document any expected configuration values). If there is no documentation for the expected configuration values, then raise a new issue.
  • An implementation error, in which raise a new issue.

Investigate spill processing problems

The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be one of:

  • A configuration error, in which case check that your MAUS configuration is that expected by the transform by comparing it to the transform's documentation (which should document any expected configuration values). If there is no documentation for the expected configuration values, then raise a new issue.
  • An implementation error, in which raise a new issue.

Investigate problems with transform death in Celery

MAUS Celery components will raise an error if a problem arises when invoking death on their current transforms. It is possible however to configure the worker to recreate and reconfigure the transforms for ongoing use. This can be done just by rerunning your client.

You may also want to, for the benefit of other users, investigate problems with death.

Investigate problems with death

The solution may depend upon the component (input, transform, merge, output) that is failing and the nature of the error message. The cause may be an implementation error, in which raise a new issue.

Check MongoDB

  • Log in as a super-user by using sudo su - or su.
  • Check MongoDB is running OK.
    $ /sbin/service mongod status
    mongod (pid 4357) is running...
    
  • If MongoDB is not running then start it up:
    $ /sbin/service mongod start
    
  • If MongoDB is running, check that MAUS is configured with the correct MongoDB connection information:
    • Check src/common_py/ConfigurationDefaults.py:
      mongodb_host = "localhost" 
      mongodb_port = 27017 
      
    • If the host is not the same as MongoDB then the localhost should be the full host name of the host on which MongoDB is running.
    • If the settings are fine then, but the current host is a different host from MongoDB, check that the MongoDB host allows connections on port 27017. If not then open the port or ask someone who knows how to do this to do it.

For more on MongoDB see,

Updated by Jackson, Mike over 9 years ago ยท 50 revisions