Project

General

Profile

Feature #1717

Offline reconstruction job in MLCR

Added by Rogers, Chris about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Category:
Online reconstruction
Target version:
Start date:
21 July 2015
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

Plan is to run an "offline" reconstruction job in the MLCR for physics shifter


Files

reco_07149_trunk.log (12.7 KB) reco_07149_trunk.log Rajaram, Durga, 22 July 2015 22:37
reco_07149_offline_recon_tracker.log (167 KB) reco_07149_offline_recon_tracker.log Rajaram, Durga, 22 July 2015 22:37
maus-input-transform.log (310 KB) maus-input-transform.log Rogers, Chris, 22 July 2015 23:04
reco_maus_scifi_recon_devel.log (12.8 KB) reco_maus_scifi_recon_devel.log Rajaram, Durga, 22 July 2015 23:40
reco_maus_scifi_recon_devel_Onrec03.log (20.5 KB) reco_maus_scifi_recon_devel_Onrec03.log Rajaram, Durga, 23 July 2015 00:56

Related issues

Related to MAUS - Feature #1312: Online reconstruction requires new APIOpenRogers, Chris15 July 2013

Actions
#1

Updated by Rogers, Chris about 6 years ago

On onrec03 - I checked out what I believe is the best code,

cd MAUS
bzr checkout bzr+ssh://bazaar.launchpad.net/~christopher-hunt08/maus/maus_scifi_recon_devel .offline_recon_tracker

I am now going through the install procedure. Note

  • we are taking data tonight, it would be great to have this set up by this evening.
  • I would like to move some code from lp:~chris-rogers/maus/1312b across as well;
  • this has the InputPySocket and OutputPySocket stuff required to get mongodb out of the online reconstruction. I have been having trouble testing the online reconstruction which has blocked a bit (and I didn't manage to find the time to workaround my problems)...
#2

Updated by Rogers, Chris about 6 years ago

I got a test fail. I am ignoring it:

======================================================================
FAIL: Check that PIDVarC pid
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mice/MAUS/.offline_recon_tracker/build/test_MapCppGlobalPID.py", line 214, in test_KL_PID
    self.assertTrue('recon_events' in spill_out)
AssertionError: False is not true
-------------------- >> begin captured stdout << ---------------------
{u'spill_number': 0, u'errors': {u'MapCppGlobalPID': u"<class 'ErrorHandler.CppError'>: In branch recon_events\nIn branch kl_event\nIn branch kl_cell_hits\nIn branch kl\nIn branch err_x\nMissing required branch err_x converting json->cpp at ValueItem::_SetCppChild"}, u'daq_event_type': u'', u'maus_event_type': u'Spill', u'run_number': 0}

--------------------- >> end captured stdout << ----------------------

======================================================================
FAIL: Check that process fills global events from detector data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mice/MAUS/.offline_recon_tracker/build/test_MapCppGlobalReconImport.py", line 98, in test_fill_Global_Event
    self.assertTrue('recon_events' in spill_out)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 587 tests in 346.813s

FAILED (SKIP=1, failures=2)
Traceback (most recent call last):
  File "/home/mice/MAUS/.offline_recon_tracker/src/common_py/docstore/root_document_store/_socket_manager.py", line 274, in _message_handler
    message = ROOT.TMessage()
AttributeError: FAIL Failed unit tests. Fatal error - aborting
#3

Updated by Dobbs, Adam about 6 years ago

Yes, they are known, Chris is working on the fix. Once he has got that sorted I will merge his code into the trunk and make a release.

#4

Updated by Rogers, Chris about 6 years ago

Okay, I think everything is installed and working. I added a small script to do the offline recon, but using InputCppDAQOnlineData; this looks to be working okay, but I didn't have any run data to test with.

#5

Updated by Rogers, Chris about 6 years ago

Nb: the online recon is not working yet, there is a small hiccup in some of the code... maybe... but the offline recon does appear to be working... but not very well tested... I will keep working on it.

#6

Updated by Rajaram, Durga about 6 years ago

What is not working with online recon?
I can look as well if it'll help

#7

Updated by Rogers, Chris about 6 years ago

I am using /home/mice/MAUS/.offline_recon_tracker on onrec03 and pushing to bzr+ssh://bazaar.launchpad.net/~mice-lcr/maus/offrec/

#8

Updated by Rogers, Chris about 6 years ago

I made a small offline job in bin/utilities/analyze_data_fast_turnaround/

#9

Updated by Rogers, Chris about 6 years ago

Tip from Durga, add

                    '-SciFiDigitizationNPECut=2.0',
                    '-SciFiPRHelicalOn=False',

Seems to be required to make recon work in a reasonable time!

#10

Updated by Rajaram, Durga about 6 years ago

These are the cards that were given by the tracker group -- and they are in the CDB BatchIteration# = 2

simulation_geometry_filename = "Stage4.dat" 
reconstruction_geometry_filename = "Stage4.dat" 
SciFiDigitizationNPECut = 2.0
SciFiPRHelicalOn = False

I have placed these cards in

onrec03: MAUS/.offline_recon_tracker/bin/utilities/analyze_data_fast_turnaround/reco.cards

With field on, I imagine the SciFiPRHelicaolOn flag would need to be removed. I suggested to Adam and Chris H. at today's MAUS meeting that they set the flag on/off by querying the CDB and figuring out if the magnets were on

#11

Updated by Rajaram, Durga about 6 years ago

The MAUS/.offline_recon_tracker branch seems to be ~5x slower than the trunk -- I'm not sure why.

I ran 100 events of 07149 from both branches -- logs attached.
The .maus_trunk branch took 2 minutes real time.
The .offline_recon_tracker branch took 11 minutes.

Haven't looked to see why that branch is that much slower.

FYI I updated MAUS/.maus_trunk and rebuilt

#12

Updated by Rogers, Chris about 6 years ago

Humm, maybe I screwed something up...

#13

Updated by Rogers, Chris about 6 years ago

I put the log of recon in here. I note that the reconstruction is the dominant time, and tracker recon is the dominant part of that. There is some other stuff, which could either be MAUS gumph or unpacking. Nb: this is the new online recon running. I think Chris Hunt was looking at pushing raw ADC into the data structure, maybe he kept that in his branch (source for this) but left it out of the main recon?

#14

Updated by Rajaram, Durga about 6 years ago

I believe you're right. It does come from the tracker recon.
For one the output file size is ~50% larger (25 MB/100 evts trunk, 40 MB/100 events in the offline_recon_tracker branch)

I haven't looked into the guts of the output to see if for instance he finds more spacepoints now, etc.

If the reconstruction in Chis Hunt's branch is doing the 'right thing' and hence taking more time, it might be something we have to live with,
but would be good to know if it's by design.

#15

Updated by Rajaram, Durga about 6 years ago

Hmm...I'm a little (more) puzzled now...

I checked out lp:~christopher-hunt08/maus/maus_scifi_recon_devel and built against the trunk's third party and
ran bin/analyze_data_offline against 07149, 100 events.

It took ~90s and produced a 25 MB file consistent with the trunk.
This was on my machine.

I'm also currently building on

onrec03:MAUS/.maus_scifi_recon_devel

#16

Updated by Rogers, Chris about 6 years ago

Maybe it's just something stupid like two people trying to run at the same time on the same machine (even against the same file?).

I tried building a onrec03:MAUS/.maus_online_reco_mods, explicitly attempting to exclude Recon modifications but I didn't see any improvement. I do have online stuff in the code I am running, so I still don't rule out an issue there, although log file hints at Recon/SciFi being the issue. At least, the InputCppDAQOnline is set up to spit out a DAQ event every 0.5 seconds. For the record, I have reconstructed 210 DAQ events in 1009 seconds.

#17

Updated by Rogers, Chris about 6 years ago

Just for information, I have left the new online recon running on onrec03 (.maus_online_reco_mods, which takes .maus_trunk as third party), with the production online recon running on onrec02. I think I will call it a night, and check log files in the morning. I might make it in for the end of the run tomorrow...

#18

Updated by Rajaram, Durga about 6 years ago

Rogers, Chris wrote:

Maybe it's just something stupid like two people trying to run at the same time on the same machine (even against the same file?).

I tried building a onrec03:MAUS/.maus_online_reco_mods, explicitly attempting to exclude Recon modifications but I didn't see any improvement. I do have online stuff in the code I am running, so I still don't rule out an issue there, although log file hints at Recon/SciFi being the issue. At least, the InputCppDAQOnline is set up to spit out a DAQ event every 0.5 seconds. For the record, I have reconstructed 210 DAQ events in 1009 seconds.

I ran 200 events from the .maus_scifi_recon_devel branch [ pulled fresh from lp ] -- 200 events, 170s, again consistent with the trunk.
So, there does seem to be some other difference between this/trunk and the online_reco_mods branch though I cannot imagine how any of the docstore stuff would affect it
Scratching to think if it's something stupid I'm doing in comparing

#19

Updated by Rajaram, Durga about 6 years ago

If we believe that Chris Hunt's branch

lp:~christopher-hunt08/maus/maus_scifi_recon_devel
contains the latest & greatest tracker reconstruction code, I will attempt to run
bin/utilities/analyze_data_fast_turnaround/analyze_data_fast_turnaround.py
from this branch
when data trickles in [ no sign of it yet ]

#20

Updated by Rajaram, Durga about 6 years ago

Hm...just ran 200 events from the scifi_recon_devel branch and tried loading it up from my trunk install. Able to browse & plot but the ROOT classdefs are different which might make physics shifter/analysis iffy (?)

#21

Updated by Rajaram, Durga about 6 years ago

Ok, I found the reason for the speed and output filesize differences between online_reco_mods & maus_trunk

[mice@miceonrec03 .maus_online_reco_mods]$ diff src/common_py/ConfigurationDefaults.py ../.maus_trunk/src/common_py/ConfigurationDefaults.py|grep SciFi
< SciFiClustExcept = 10000 # exception is thrown # Changed to accomodate bug in tracker reconstruction, Chris Heidt/Rogers, 21-06
> SciFiClustExcept = 100 # exception is thrown
[mice@miceonrec03 .maus_online_reco_mods]$

100 is also the parameter value in Chris Hunt's scifi_recon_devel branch, so I believe that's what it should be.

Have now modified

onreco3:.maus_online_reco_mods/src/common_py/ConfigurationDefaults.py
to reflect this.

The speeds & output are now consistent between the online, scifi_devel and trunk branches

#22

Updated by Rajaram, Durga about 6 years ago

Tried running

python bin/utilities/analyze_data_fast_turnaround/analyze_data_fast_turnaround.py -configuration_file bin/utilities/analyze_data_fast_turnaround/config.py

Runs for ~100 events and then throws an exception about too many data references.
Same story from both .maus_online_reco_mods & .maus_trunk

Digging...

#23

Updated by Rogers, Chris about 6 years ago

This means that you have a memory leak. There is a datacard which you can vary to fix it (I will head in now and look up the name when I get there) or you can recompile and change the #ifdef in src/common_cpp/DataStructure/Data.cc

#24

Updated by Rajaram, Durga about 6 years ago

It is strange.

After the run ended, I copied over all the 7273.00[0-6] files to onrec03:devel/data
and ran bin/analyze_data_offline.
It ran through all 6 chunks without any data reference count complaint.
Why does it complain when running InputCppDAQOnlineData? Hmm.

#25

Updated by Rajaram, Durga about 6 years ago

There is also a problem with the tracker readout.
There is an error from the VLSB on every event which caused an unpacking exception which means the entire spill gets discarded.

This is the reason the shifters reported empty onrec plots.

To "recover" the non-tracker i.e. TOF etc data, one has to reconstruct with Enable_VLSB_Unpacking=False

I have made an entry in the logbook
https://micewww.pp.rl.ac.uk/elog/MICE+Log/3347

#26

Updated by Rajaram, Durga about 6 years ago

I wanted to run the fast turnaround to at least see data from other detectors. To this end, I created a config

bin/utilities/analyze_data_fast_turnaround/config_notracker.py
to not unpack the tracker.
Running
[mice@miceonrec03 .maus_online_reco_mods]$ nohup python bin/utilities/analyze_data_fast_turnaround/analyze_data_fast_turnaround.py -configuration_file bin/utilities/analyze_data_fast_turnaround/config_notracker.py >& analyze_data_fast_turnaround_notracker.log &
[1] 4576

No RefCount errors so far [ ~1230 events & going ].

The output goes in maus_output_7274.root & the log file is as in the command listed above.

Could it be that it was happening before with the tracker errors because when an unpacking error happens, the error-trapping is not deleting a data ref when it should?

In single_thread.py I check for bad-data-input and if bad input, I disregard a run# change [ because InputCppDAQ does not spill or run# when an error happens ] but couldn't see a reason to delete data ref. I maybe missed something..

#27

Updated by Rajaram, Durga about 6 years ago

On the bright side, the fast turnaround reconstruction kept up with real data [ though with the tracker unpacking off ]

#28

Updated by Rajaram, Durga about 6 years ago

I have been running offline reconstruction on onrec03. Have been running against the trunk

MAUS/.maus_trunk

No errors, and reconstruction has been keeping up with data, ending when the run ends.

Because there is some field, I decided to turn the SciFiPRHelicalOn flag

SciFiPRHelicalOn=True

The card is in
MAUS/.maus_trunk/bin/utilities/analyze_data_fast_turnaround/config_helical.py

Requires babysitting right now to re-start reconstruction after a run ends. Need to get PVs from EPICs to indicate end_run/start_run and such. Or poll & wait for the monitor

#29

Updated by Rajaram, Durga about 6 years ago

Rajaram, Durga wrote:

Requires babysitting right now to re-start reconstruction after a run ends. Need to get PVs from EPICs to indicate end_run/start_run and such. Or poll & wait for the monitor

I take that back. The monitor does catch end of run and stops and starts writing to a new xxxx_run#.root file when a new run starts.

#30

Updated by Rajaram, Durga about 6 years ago

So, finally after ~17,000 events, it died with a 'too many data refs' error.

#31

Updated by Rajaram, Durga about 6 years ago

Rajaram, Durga wrote:

So, finally after ~17,000 events, it died with a 'too many data refs' error.

To clarify, run 7289 had ended, and 7290 had automatically started reconstructing, and after a combined 17,000 events, it croaked.

#32

Updated by Rajaram, Durga about 6 years ago

I had started reconstructing 7289 maybe 30 minutes after the run had started. The question is did it catch up and reconstruct everything, or did it start/stop midway. I'll reconstruct that run again by hand to verify.

#33

Updated by Rajaram, Durga about 6 years ago

I re-ran 7289 by hand to pick up the spills I had missed [ because I started running the reconstruction 'late' ]

#34

Updated by Rajaram, Durga about 6 years ago

The reconstructed output from Runs 7285 (RefRun), 7286, 7287, 7288, 7289, 7290 are available on

onrec03:MAUS/.maus_trunk/output/

They were reconstructed with the trunk [ rev. 897 ], with the PRHelical=true, config listed in http://micewww.pp.rl.ac.uk/issues/1717#note-28

Though there's nothing special about running the reconstruction here, would it be possible to burden the physics shifter [ or the tracker group ] to look at these output files to make sure that the MLCR reconstruction works OK?

Need to think how [ automatic compaction/cron rsync? ] & where to [ datamover? IC web? micewww? ] we want to move reconstructed data...

#35

Updated by Rogers, Chris about 6 years ago

Need to think how [ automatic compaction/cron rsync? ] & where to [

datamover?

Okay, but not easily accessible from big bad world

IC web?

Note this is a GRID site, so we probably buy all of the GRID issues if we try to do this.

micewww?

Not enough space there right now. We discussed running a web proxy over http from heplnm069 in the ppd meeting last week. There is a security advantage in running a web proxy rather than directly mounting heplnm069. There may be a disadvantage in terms of maintenance, I am not quite sure. This is what I would shoot for... maybe we don't want to permanently store recon data here though, I don't think we have enough space for that. Probably we can manage, say, last two centuries (or cron job eats oldest run when the storage runs low).

httpd is already running on heplnm069 and serving yum repos I think.

#36

Updated by Rogers, Chris about 6 years ago

  • Status changed from Open to Closed
  • Assignee changed from Rogers, Chris to Rajaram, Durga
  • % Done changed from 0 to 100

I think this is now closed? For the record, Durga did it in the end...

Also available in: Atom PDF