Project

General

Profile

Bug #1698

OnRec investigation

Added by Rogers, Chris over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Online reconstruction
Target version:
-
Start date:
20 June 2015
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

Report that onrec is broken cf:

https://micewww.pp.rl.ac.uk/elog/MICE+Log/3195


Files

maus_bug.tar.gz (157 Bytes) maus_bug.tar.gz Rogers, Chris, 24 June 2015 10:31
#1

Updated by Rogers, Chris over 7 years ago

I got the online recon working in "test" mode against 07050. Looks like stuff is coming through okay, so not sure why it wasn't working against data (target is on the blink so I have not been able to test against live data yet) - are the DAQ environment variables okay?

Edit to add: I meant I was running test_analyse_data_online.py and modified the script to use run 07050 (Step IV run), previously it was running "usual" Step I summary. There is a list of ints which define the runs that are executed. I was switching between 07050 and 07065.

#2

Updated by Rogers, Chris over 7 years ago

Hum, I tried on 07065 and for some reason I am now getting an error in input-transform like

---------- RUN 0 ----------
Configuring Celery nodes and birthing transforms...
Clearing Globals
Traceback (most recent call last):
  File "/home/mice/MAUS/.maus_dev/bin/online/analyze_data_online_input_transform.py", line 69, in <module>
    run()
  File "/home/mice/MAUS/.maus_dev/bin/online/analyze_data_online_input_transform.py", line 66, in run
    MAUS.Go(my_input, my_map, reducer, output_worker, data_cards) 
  File "/home/mice/MAUS/.maus_dev/src/common_py/Go.py", line 132, in __init__
    self.get_job_footer())
  File "/home/mice/MAUS/.maus_dev/src/common_py/framework/input_transform.py", line 286, in execute
    self.process_event(spill)
  File "/home/mice/MAUS/.maus_dev/src/common_py/framework/input_transform.py", line 318, in process_event
    self.start_of_run(current_run_number)
  File "/home/mice/MAUS/.maus_dev/src/common_py/framework/input_transform.py", line 221, in start_of_run
    self.config_doc, self.config_id, self.run_number)
  File "/home/mice/MAUS/.maus_dev/src/common_py/framework/utilities.py", line 258, in birth_celery
    num_nodes = CeleryUtilities.ping_celery_nodes()
  File "/home/mice/MAUS/.maus_dev/src/common_py/framework/utilities.py", line 223, in ping_celery_nodes
    raise NoCeleryNodeException()
framework.utilities.NoCeleryNodeException: No Celery nodes are available
#3

Updated by Rogers, Chris over 7 years ago

Nb: celeryd.log looks just fine; time stamps indicate that maus-input-transform.log was the first to die.

$> ps -eF | grep celery

reveals no lurking celery process...

#4

Updated by Rogers, Chris over 7 years ago

I just went through the 07050 -> 07065 loop; 07065 consistently fails, 07050 consistently works. I guess at this point that 07065 is making an error in one of the celery workers? I don't know... let's do some more digging. Welcome to the darkness.

#5

Updated by Rogers, Chris over 7 years ago

I was checking the output for the reconstruction job, I see line like:

Spills input: 136 Processed: 142 Failed 42

Note these counters are generated by input_transform.py on line 148. Looking in celeryd.log

[2015-06-21 00:29:00,754: INFO/MainProcess] Task mauscelery.maustasks.MausGenericTransformTask[d4559496-8f98-454e-b26e-f5dfca614e7f] succeeded in 0.0125689506531s: '{"daq_event_type":"end_of_burst","errors":{},...
[2015-06-21 00:29:00,755: INFO/MainProcess] Got task from broker: mauscelery.maustasks.MausGenericTransformTask[ea0cd993-48b0-4831-8401-4c99e121a797]
[2015-06-21 00:29:00,772: INFO/PoolWorker-28] None[ea0cd993-48b0-4831-8401-4c99e121a797]: Task invoked by miceonrec02.micenet.rl.ac.uk-6916
[2015-06-21 00:29:00,784: INFO/MainProcess] Task mauscelery.maustasks.MausGenericTransformTask[ea0cd993-48b0-4831-8401-4c99e121a797] succeeded in 0.0126881599426s: '{"daq_event_type":"start_of_burst","errors":{...
[2015-06-21 00:29:03,282: ERROR/MainProcess] Task mauscelery.maustasks.MausGenericTransformTask[f4da04c9-71b0-4977-b4d5-eb740645e422] raised exception: TimeLimitExceeded(10,)
Traceback (most recent call last):
  File "/home/mice/MAUS/.maus_dev/third_party/install/lib/python2.7/site-packages/celery-2.5.5-py2.7.egg/celery/concurrency/processes/pool.py", line 370, in _on_hard_timeout
    raise TimeLimitExceeded(hard_timeout)
TimeLimitExceeded: 10
[2015-06-21 00:29:03,283: ERROR/MainProcess] Hard time limit (10s) exceeded for mauscelery.maustasks.MausGenericTransformTask[f4da04c9-71b0-4977-b4d5-eb740645e422]
[2015-06-21 00:29:03,335: INFO/PoolWorker-41] child process calling self.run()
[2015-06-21 00:29:03,336: INFO/PoolWorker-41] Setting MAUS ErrorHandler to raise exceptions
Warning - failed to open module /home/mice/MAUS/.maus_dev/src/legacy/FILES/Models/Modules/Box - I assume it is not required

Looks like the workers are only allowed to exist for 10 seconds, before they have to reinitialise (i.e. restart the entire subprocess)! Urk!

#6

Updated by Rogers, Chris over 7 years ago

Ta-da!

Spills input: 165 Processed: 168 Failed 0

So I changed a few things:

  • On a tip from tracker folks, I tried to catch explicitly "tracker failed to unpack errors". I am not sure if this did anything
    --- src/input/InputCppDAQData/InputCppDAQData.cc    2015-05-29 12:22:21 +0000
    +++ src/input/InputCppDAQData/InputCppDAQData.cc    2015-06-20 23:09:06 +0000
    @@ -140,7 +140,11 @@
    -        _vLSBFragmentProc_cpp->fill_daq_data();
    +        try {
    +            _vLSBFragmentProc_cpp->fill_daq_data();
    +        } catch(...) {
    +            std::cerr << "TRACKER UNPACKING ERROR" << std::endl;
    +        }
    
  • I increased the timeout on the celery task to avoid the timeout errors
    --- src/common_py/mauscelery/tasks.py    2014-06-19 15:24:48 +0000
    +++ src/common_py/mauscelery/tasks.py    2015-06-20 23:46:51 +0000
    @@ -34,7 +34,7 @@
     # (see https://github.com/celery/celery/issues/802) but appears to require
     # celery > 3.0 or so. I believe decorators are resolved when they are first
     # encountered by the parser (but others may know more) - Rogers
    -@task(name="mauscelery.maustasks.MausGenericTransformTask", time_limit=10)
    +@task(name="mauscelery.maustasks.MausGenericTransformTask", time_limit=100)
    

    I also changed this, not sure if it does anything
    --- src/common_py/mauscelery/celeryconfig.py    2013-09-26 14:24:36 +0000
    +++ src/common_py/mauscelery/celeryconfig.py    2015-06-20 23:44:15 +0000
    @@ -33,7 +33,7 @@
     # Can only fetch the result of a task once as it is sent as a message.
     CELERY_RESULT_BACKEND = "amqp" 
     # Results expire in 30 seconds.
    -CELERY_TASK_RESULT_EXPIRES = 30
    +CELERY_TASK_RESULT_EXPIRES = 60
     #CELERY_AMQP_TASK_RESULT_EXPIRES = 30 deprecated
    
#7

Updated by Rogers, Chris over 7 years ago

I note that run 7065 now runs in "test online" without crashing.

#8

Updated by Rogers, Chris over 7 years ago

Logs attached. I note that this is the same bug as before

#10

Updated by Rogers, Chris over 7 years ago

Durga noted that there was a crash in the ckov reducer. I looked at run 07089 in online mimic, saw the same crash in maus-input-transform.log.

#11

Updated by Rajaram, Durga over 7 years ago

The Ckov error was a red herring. It turned out it was coming from empty spills, leading to 0 particle triggers in a spill, in turn causing the DAQ to say it was a bad event.
I'm testing a fix. Will push to MLCR & trunk after I validate.

#12

Updated by Rajaram, Durga over 7 years ago

Pushed the fix to the MAUS trunk rev.874
It is also on onrec02/03 MAUS/.maus_dev

#13

Updated by Rogers, Chris over 7 years ago

Few additional things:

  1. looks like now with DS on, MAUS is lagging. Probably it is useful to have the ability to sample the data. I implemented this on onrec02 with datacard daq_sampling_period
  2. I pushed onrec02 dev branch to lp:~maus-mlcr/maus/control-room-3

Now maus-input-transform.log looks like:

...
Spill 31 1
Spill 31 1
Spill 32 2
Spill 32 2
Spill 32 2
Spill 33 3
Spill 33 3
Spill 33 3
Spill 34 4
Spill 34 4
Spill 34 4
Spill 35 5
Spill 35 5
Spill 35 5
Spill 36 6
Spill 36 6
Spill 36 6
Spill 37 7
Spill 37 7
Spill 37 7
Spill 38 8
Spill 38 8
Spill 38 8
Spill 39 9
Spill 39 9
Spill 39 9
Spill 40 0
1 spills left in buffer
Polling celery queue
    Celery task 3925e8e2-a741-442f-88c4-faf3dcd378b8 status is PENDING
    Celery task 578c8b96-edf8-4e67-820e-feefd414fa55 status is PENDING
    Celery task 8c7a47d6-9974-4c0e-b066-d85e19b20ecd status is SUCCESS
Spills input: 9 Processed: 15 Failed 1
2 tasks remain in queue
Submitting task 1af821f0-1298-4c0b-81b4-a58874141a50
Spill 40 0
1 spills left in buffer
Polling celery queue
    Celery task 3925e8e2-a741-442f-88c4-faf3dcd378b8 status is PENDING
    Celery task 578c8b96-edf8-4e67-820e-feefd414fa55 status is PENDING
    Celery task 1af821f0-1298-4c0b-81b4-a58874141a50 status is PENDING
3 tasks remain in queue
Submitting task ff13f332-4906-4332-8db3-8c106e90c89e
Spill 40 0
1 spills left in buffer
Polling celery queue
    Celery task 3925e8e2-a741-442f-88c4-faf3dcd378b8 status is FAILURE
Spills input: 11 Processed: 16 Failed 2
    Celery task 578c8b96-edf8-4e67-820e-feefd414fa55 status is PENDING
    Celery task 1af821f0-1298-4c0b-81b4-a58874141a50 status is SUCCESS
Spills input: 11 Processed: 17 Failed 2
    Celery task ff13f332-4906-4332-8db3-8c106e90c89e status is PENDING
2 tasks remain in queue
Submitting task eec5519f-7ebf-47a2-aaac-44e9929dba00
Spill 41 1
Spill 41 1
Spill 41 1
Spill 42 2
...

where the Spill X Y refers to Spill number followed by a counter to the next spill that MAUS will attempt to reconstruct.

#14

Updated by Rogers, Chris over 7 years ago

Nb: note the failures are due to timeouts (this is run 7149, and timeout limit is 100 s)

#15

Updated by Rajaram, Durga over 7 years ago

For higher rate runs [ 7155(reference run), 7156 (300 MeV muons) ~ 200 triggers/spill ], sampling /10 still showed a lag.
In addition there was one error from MongoDB about the doc size exceeding the limit (16 MB!) which crashed the online reco.
Restarted it and didn't see another error for the remainder of the high-rate runs.

For the runs with ~20-30 triggers/spill, sampling /10 keeps the online reco in sync with data.
There was a plot refresh rate of /60 which I upped to /5. This in conjunction with the sampling /10 means histogram displays are getting updated every 20 spills.

#16

Updated by Rogers, Chris over 7 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

Looks like it has been running smoothly, with thanks to Durga.

Also available in: Atom PDF