Project

General

Profile

Feature #839

start of run, start of job, end of ...

Added by Tunnell, Christopher almost 10 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Detector Integration
Target version:
Start date:
21 December 2011
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

Give some thought. Data v.s. MC.


Files

MAUS_Globals_handling.odt (17.6 KB) MAUS_Globals_handling.odt Rogers, Chris, 22 December 2011 18:21
MAUS_Globals_handling.odt (19.2 KB) MAUS_Globals_handling.odt Rogers, Chris, 03 January 2012 11:53
Globals_Handling.odt (122 KB) Globals_Handling.odt Rogers, Chris, 18 January 2012 14:10
#1

Updated by Tunnell, Christopher almost 10 years ago

  • Category set to common_py
  • Assignee set to Tunnell, Christopher
#3

Updated by Rogers, Chris almost 10 years ago

Can I just give you a partial job spec just to throw some ideas around... it comes at it from a completely different angle so might be interesting (see attachment)

#4

Updated by Rogers, Chris over 9 years ago

Throw the job spec back in with a few modifications. Would appreciate comments if possible...

#5

Updated by Tunnell, Christopher over 9 years ago

Comments Re v2 (line numbers help, but meh):

pg. 3

  • In the DAQ, there are these calibration events we all know about. May be worth mentioning that we are ignoring them?
  • Just for n00bies: might be worth saying that a target dip corresponds to ~100 muons going through our imaginary cooling channel. Just to be utterly clear.
  • "Data taken from the same run may be executed in different processes" I don't follow. Are we parallelizing at any other level than a spill? You mean process in the distributed computing sense? I think it would be much more clear to say we have spills, runs, and jobs. We parallelize at the spill level.
  • I'd say for the job: the job is normally just the script you run in the bin directory. It's like the program. When the job is done, everything is done.
  • Idea of why you want to analyze many runs: variations of a quantity as a function of time. Calibrations. etc.
  • MICERun? Haven't seen this yet. And are they actually pointers in the C++ sense? Anyway, the use of the term 'global' may not be appropriate? There are many scopes and the global scope is the job scope. The run scope is MICERun? Is this just a reference to G4MICE legacy stuff?
  • I would avoid as much as possible talking about the way we do things that we regret: datacards. Hopefully we fix this soon and this is the specification that applies after things are fixed? But it's a good motivation for present day!
  • Need a graph. Use dot or something?

Okay, MAUSRun I imagine is immutable? Same for the job? Otherwise, there can be issues with parallelization... I have a bit of trouble seeing how things turn into both Python and C++ code: will the process interface just be accepting a MAUSSpill, MAUSJob, MAUSRun object? And MAUSSpill is the only one that changes?

pg. 4:

  • "The MAUSRun initialisation and destruction will be handled by the MAUSProcess." Not Go? Is Go like MAUSJob?
  • Datetime sounds good. I'd make it super duper easy to convert that to seconds though by, maybe, and example. It's a tough trade-off: storing seconds means that people can subtract times for their calibration or whatever easily but storing a date that is human readable is useful to understand physically what the data means. Sounds good.
  • Graph of the task breakdown and the dependencies please. It was really confusing to know what you mean by task breakdown until you read the second one referencing the first.
  • bzr branch info? revno and branch name?
  • bzr: version string might not make sense. How do we ensure people don't hack a versioned copy and keep that string the same (this leads into some build system questions I want to chat with everybody about sometime)
  • Configuration datacards in json format (string) Hell yes!!!!! Various experiments don't implement this until way too late and it's so crucial to knowing what you actually ran!!
  • EndOfRunObject: maybe a reason why the run ended? Crash? User hit the stop button?

Design requirement - Important aside: I guarantee you that we'll cut the first few minutes and last 5 minutes of any run. It's a wise thing. Normally things start in weird orders so you want to cut the start of the run. But most importantly, it takes shifters about 5 minutes to notice something screwy happened and they needed to stop the run. If you cut the end of the run, then you are sure you avoid these issues. We should make sure this is possible.

  • StartOfRun: what about magnet currents? Or number of actuations?

pg. 5

  • EndOfRun: I think of this more as a signal. It tells the reducers to spit out a different type than normal of output. Maybe the reducer has some field that tells the output that "these are end of run plots, so save them differently?".
  • "The RunIO has the job of handling any run information in the master node." - I viewed the RunIO object as being copied (since it's immutable) to all the worker nodes. I don't understand the master node comment.
  • MAUSProcess: as an aside, it always puzzled me (mea culpa?) how we could get geant4 stuff to talk to python through MAUS. You can drive geant4 from python, but there isn't anything like JSON or TStreamers for geant4 type information as far as I can tell.

pg. 6:

  • Geant4 isn't very modular: we may just have to only have one instance of MAUS per run. As far as I can tell, every other experiment does this. Grr. Ideally we just rebirth or something.
  • The MC being negative run numbers is pretty clever. It makes me wonder if people can trick themselves by checking if run_number > 0. We should be able to run a MC_vs_data_checker that makes the run_number := abs(run_number). My vague recollections of SNO, telling the difference between MC and data was never really an issue. But let's strive to make it so we really can't tell the difference! But please please make a test that makes flips the sign of data and MC run numbers :)
  • MAUSRun: the calibration info is key. We haven't done enough with that yet. Good idea.
  • Big objection:
    At the moment I don't envisage implementing python bindings for the rest of the MAUSRun. This makes accessing any of this data from python code not possible, making any reasonable reconstruction in python impossible. If folks want to access e.g. calibration from MAUSRun they need to do the python bindings themselves.

    Python is the only language that can do the configDB at the moment in MAUS. Calibrations etc. get spit out as a python dictionary at the moment. We are much closer to doing reconstruction in python than C++. I would also argue that we should do reconstruction first in python correctly then deal with doing x10 the work to implement things faster etc. in C++.

I'm tired. Page 8 and 9 are a different topic. I'll peak at them at some point.

Overall, nice job! It's nice to see somebody thinking about how to tie is all together. There are some criticisms, but only to make a better final product: you have a more clear vision of it than I do.

#6

Updated by Rogers, Chris over 9 years ago

  • In the DAQ, there are these calibration events we all know about.
    May be worth mentioning that we are ignoring them?

At the moment. I think we unpack like a normal physics event but then
leave them in the data structure but pass them over until someone tells
us what to do with them.

  • Just for n00bies: might be worth saying that a target dip
    corresponds to ~100 muons going through our imaginary cooling
    channel. Just to be utterly clear.

Good point.

  • "Data taken from the same run may be executed in different
    processes" I don't follow. Are we parallelizing at any other level
    than a spill? You mean process in the distributed computing sense?
    I think it would be much more clear to say we have spills, runs, and
    jobs. We parallelize at the spill level. * I'd say for the job: the
    job is normally just the script you run in the bin directory. It's
    like the program. When the job is done, everything is done. * Idea
    of why you want to analyze many runs: variations of a quantity as a
    function of time. Calibrations. etc.

Sure, also just multiple runs with the same configuration and we want to
do "just use a big data set".

  • MICERun? Haven't seen this yet. And are they actually pointers in
    the C++ sense? Anyway, the use of the term 'global' may not be
    appropriate? There are many scopes and the global scope is the job
    scope. The run scope is MICERun? Is this just a reference to G4MICE
    legacy stuff?

src/legacy/Interface/MICERun.hh Yes they are C++ pointers. Yes it is
legacy G4MICE stuff. Yes the scope is really global (at least static
member functions).

  • I would avoid as much as possible talking about the way we do
    things that we regret: datacards. Hopefully we fix this soon and
    this is the specification that applies after things are fixed? But
    it's a good motivation for present day!

Sorry, I meant data cards as in Configuration.py. Configuration is a
vague term that refers to many things that I find confusing (it also
refers to e.g. stuff we pull off config db for example). I think most
people know what data cards are

  • Need a graph. Use dot or something?

Okay.

Okay, MAUSRun I imagine is immutable? Same for the job? Otherwise,
there can be issues with parallelization... I have a bit of trouble
seeing how things turn into both Python and C++ code: will the
process interface just be accepting a MAUSSpill, MAUSJob, MAUSRun
object? And MAUSSpill is the only one that changes?

Here's the whole subtlety - MAUSRun changes spill to spill - if the only
communication we allow to workers is through the spill, then the only
way the worker can know if the run changed is to read in the spill.
Unless we stop all workers and reinitialise on a new run (some
advantages to that approach although it's not the one outlined here).

pg. 4:

  • "The MAUSRun initialisation and destruction will be handled by the
    MAUSProcess." Not Go? Is Go like MAUSJob?

I think Go.py lives on the master node only - initialisation and
destruction needs to be done on the child nodes.

  • Datetime sounds good. I'd make it super duper easy to convert that
    to seconds though by, maybe, and example. It's a tough trade-off:
    storing seconds means that people can subtract times for their
    calibration or whatever easily but storing a date that is human
    readable is useful to understand physically what the data means.
    Sounds good.

Have a look at python datetime module - that's where I got this from.
They have a timedelta class that does what you say (like normal!). Just
we want to be able to work in C++ also.

  • Graph of the task breakdown and the dependencies please. It was
    really confusing to know what you mean by task breakdown until you
    read the second one referencing the first.

Okay.

  • bzr branch info? revno and branch name? * bzr: version string
    might not make sense. How do we ensure people
    don't hack a versioned copy and keep that string the same (this leads
    into some build system questions I want to chat with everybody about
    sometime)

Try

.bzr/branch/branch.conf
.bzr/branch/last-revision
ConfigurationDefaults maus_version field

I don't think there is any way to check for changes without calling
bzr status
which I would rather not do. Maybe that is the only way (but should at least keep bzr optional, so we record a flag to say whether that failed or not)

  • Configuration datacards in json format (string) Hell yes!!!!!
    Various experiments don't implement this until way too late and it's
    so crucial to knowing what you actually ran!!

Right.

  • EndOfRunObject: maybe a reason why the run ended? Crash? User hit
    the stop button?

(or ran out of input data). Good idea

Design requirement - Important aside: I guarantee you that we'll cut
the first few minutes and last 5 minutes of any run. It's a wise
thing. Normally things start in weird orders so you want to cut the
start of the run. But most importantly, it takes shifters about 5
minutes to notice something screwy happened and they needed to stop
the run. If you cut the end of the run, then you are sure you avoid
these issues. We should make sure this is possible.

  • StartOfRun: what about magnet currents? Or number of actuations?

This is config db stuff I think, I don't think we want to duplicate it. But we only know number of actuations at the end of run, hence EndOfRun data.

  • EndOfRun: I think of this more as a signal. It tells the reducers
    to spit out a different type than normal of output. Maybe the
    reducer has some field that tells the output that "these are end of
    run plots, so save them differently?".

I just think we will find some run summary info we want to stick in the
output - for example number of particles processed, random reducer output, ... ... not sure what yet, just a feeling (and if not, meh, we have an empty data structure, no harm done)

  • "The RunIO has the job of handling any run information in the
    master node." - I viewed the RunIO object as being copied (since it's
    immutable) to all the worker nodes. I don't understand the master
    node comment.

Probably I just don't understand how communication with workers happens
in Celery. I was trying to keep all of the communication with workers go
through the spill documents. But the other email thread with Mike Jackson indicates that you can send Python classes direct...

We want to be able to call all of the calibration, configuration things

  • MAUSProcess: as an aside, it always puzzled me (mea culpa?) how we
    could get geant4 stuff to talk to python through MAUS. You can drive
    geant4 from python, but there isn't anything like JSON or TStreamers
    for geant4 type information as far as I can tell.

Python points at the memory location of the C++ binary in memory and executes it. That's in the python/C API (but handled by SWIG).

pg. 6:

  • Geant4 isn't very modular: we may just have to only have one
    instance of MAUS per run. As far as I can tell, every other
    experiment does this. Grr. Ideally we just rebirth or something.

Right, we probably have to kill and reinitialise workers. Not sure if that's possible. For now we do as you say, only one MAUS per run for MC.

  • The MC being negative run numbers is pretty clever. It makes me
    wonder if people can trick themselves by checking if run_number > 0.
    We should be able to run a MC_vs_data_checker that makes the
    run_number := abs(run_number). My vague recollections of SNO,
    telling the difference between MC and data was never really an issue.
    But let's strive to make it so we really can't tell the difference!
    But please please make a test that makes flips the sign of data and
    MC run numbers :)

Good point.

  • MAUSRun: the calibration info is key. We haven't
    done enough with that yet. Good idea.

Thanks.

  • Big objection:
    At
    > the moment I don't envisage implementing python bindings for the rest
    > of the MAUSRun. This makes accessing any of this data from python
    > code not possible, making any reasonable reconstruction in python
    > impossible. If folks want to access e.g. calibration from MAUSRun
    > they need to do the python bindings themselves.

Well, the TOF code is all C++ and Tracker group are writing everything in C++. EMR and KL I predict will be all C++. That just leaves Ckov... I thought it might be a bit controversial but not too bad. If you're really unhappy we can think again.

I'm tired. Page 8 and 9 are a different topic. I'll peak at them
at some point.

I just wanted to make it in the specification that developer will write tests and all that stuff. It's more cut and paste "general information" though you are welcome to comment.

#7

Updated by Tunnell, Christopher over 9 years ago

I don't care either way if people use C++ (I just think prototyping in Python is a much better path to follow first...). My main comment was that currently the ConfigDB interface is in Python. We can convert this to JSON easy enough for C++ people. But the interface to calibrations etc. is Python.

#8

Updated by Rogers, Chris over 9 years ago

  • Category changed from common_py to Detector Integration

I guess we need to bridge the gap from python to C++ somewhere. The question you rightly ask is where. I think that the members of the MAUSRun are largely C++ objects where they are defined, and will be called by C++ code where they are not. So member data is:

Current RunNumber
Calibrations, cabling and associated data for each detector system
Field configuration (BTFieldConstructor)
MiceModules configuration (geometry)

Calibrations and cabling have been defined only for Tof, where we have C++ code that just stores everything on disk (probably because devs don't know how to call the python config db stuff). Tracker group only plan to write code in C++. Don't know about KL, Ckov... EMR is Yordan, I imagine he will do the same thing as he did for TOF. So my feeling is that the call should be done in C++, rather than Python.

I/we may need to write a config db C++ call which returns a Json::Value (which stores the python dict).

--

Aside:
Humm, thinking some more...

The way I think this will work is that e.g. TOF group hand us an TofCalibration::Update() function which we call from the MAUSRun object. Thinking about it we probably need to be a bit more sophisticated in how we register the calibrations etc than I specified so far. Say we have a vector of MAUS. Something like a MAUSPerRunData abstraction that holds data that needs to be stored per run, each with an Update() function. Then MAUSRun iterates over a list/vector of these objects and does the relevant updating. Specialisations then do stuff like updating MiceModules, fields, etc. I should add the MAUSPerRunData abstraction to the specification (okay?)

#9

Updated by Rogers, Chris over 9 years ago

  • File Globals_Handling.odt added

Another version of specification...

#10

Updated by Rogers, Chris over 9 years ago

Help if I upload the correct doc...

#11

Updated by Rogers, Chris over 9 years ago

  • File deleted (Globals_Handling.odt)
#12

Updated by Rogers, Chris over 9 years ago

  • Assignee changed from Tunnell, Christopher to Rogers, Chris
#13

Updated by Rogers, Chris about 9 years ago

Status as of release 0.3.1

The majority of the code has been implemented. I changed the implementation slightly from the specification.

Globals

Globals C++ class provides accessors to pointers to objects that are required in more than one MAUS module. I have in mind here things like field maps, geometries, calibration and cabling, error handlers, etc. GlobalsManager was also added providing routines to mutate the Globals. The reason for this split implementation was that I didn't want low level things, that want to access low level objects (like ErrorHandler, etc) to need to know about high level things like FieldMaps or Geant4. So the Globals can exist using forward declaration, but in order to allocate memory we need to know about the actual implementation. Initialisation and Deletion is handled in the GlobalsManager. At the moment I have left the legacy MICERun in the code base, but this should replace it.

I haven't put the Python ErrorHandler in yet. Not sure if we need it because python doesn't have the problems with initialisation order that C++ has (initialisation order is strictly in the order modules are imported)

JobHeader/JobFooter

Nothing done on JobHeader or JobFooter - need something here.

RegisterRunActions

Nothing done here

RunActionManager

RunActionManager provides a hook to update the globals. This holds a list of RunActions each of which has a StartOfRun and EndOfRun function that is called by the RunActionManager. At the start of each run, MAUS then should do the following, in order. The hooks for developers to contribute RunActions have been written, but the call has not been added to MAUS.

I considered here putting the RunActionManager in the Module's Birth() and Death() but decided that either (a) we call StartOfRun()/EndOfRun() in every module, leading possibly to unnecessary duplication of the call (b) We invent special modules, which is difficult to implement for the Reduce-Output. Now it turns out that Mike Jackson already calls a StartOfRun() and EndOfRun() in the workers called from top level, so I plan to call the RunActionManager there - potential problem is I need to access run number which doesn't come with that call. Potential other problem is that I need to return the RunHeader/RunFooter (one from each node), which I haven't done yet or figured out how to do.

Single Threaded

  1. Call RunActionManager::StartOfRun (Not implemented)
  2. Birth Inputter
  3. Birth Mapper
  4. Birth Reducer
  5. Birth Outputter
  6. Process
  7. Death Outputter
  8. Death Reducer
  9. Death Mapper
  10. Death Inputter
  11. Call RunActionManager::EndOfRun (Not implemented)

Input-Transform

  1. Call RunActionManager::StartOfRun (Not implemented)
  2. Birth Inputter
  3. Birth Mapper
  4. Process
  5. Death Mapper
  6. Death Inputter
  7. Call RunActionManager::EndOfRun (Not implemented)

Reduce-Output

  1. Call RunActionManager::StartOfRun (Not implemented)
  2. Birth Reducer
  3. Birth Outputter
  4. Process
  5. Death Outputter
  6. Death Reducer
  7. Call RunActionManager::EndOfRun (Not implemented)

The RunActionManager::StartOfRun/EndOfRun create a RunHeader/RunFooter object that are sent to the Outputter, one for each object. The RunHeader and RunFooter at the moment only contains the run number. However, it should contain at least all transient information about the run (i.e. information that might change if I reran the same job with the same code version at a different time). Foreseen are: cabling unique ID, calibration unique ID, geometry unique ID, time stamp in the run header; memory usage, timestamp, error summary in the run footer.

JsonCppConverters have not been written for the RunHeader/RunFooter object.

Python bindings

A new area, src/py_cpp/ has been created to put python bindings to MAUS features (typically objects reference by Globals). Each file in this area is built by scons into a shared object file and placed into build/maus_cpp/*.so, linked against libMausCpp.so. These should be importable as Python libraries. Three files have been added:

  1. maus_cpp.globals contains routine to initialise the globals
  2. maus_cpp.field contains routine to access the field map
  3. maus_cpp.run_action_manager contains routine to call the run action manager start of run, end of run. This should return the RunHeader and RunFooter but at the moment does not - I couldn't figure out how to use ROOT convert to a PyObject* (TPython.h looks like it is helpful, but for some reason symbols were missing at linker stage - maybe missing implementation?)

Summary of Modifications to Specification

The MAUSRun was renamed to RunActionManager. All accessors to data were moved into Globals.
The MAUSPerRunData was renamed to RunActionBase.
The MAUSProcess object was renamed to Globals and GlobalsManager, and implementation split as discussed above
The StartOfRun, StartOfJob, EndOfRun, EndOfJob were renamed to RunHeader, JobHeader, RunFooter, JobFooter

Summary of Remaining Actions

  1. Globals
    1. Remove legacy MICERun; deal with fall out
  2. JobHeader/JobFooter
    1. need python functions for initialisation (probably in Go.py)
    2. additional function for writing into the output stream (probably need to modify the API for Outputters to handle different event types)
    3. Implement relevant bindings in the data structure and Json/Cpp conversion
  3. StartOfRun/EndOfRun
    1. The RunHeader/RunFooter need to be returned from the StartOfRun/StartOfJob
    2. additional function for writing into the output stream (probably need to modify the API for Outputters to handle different event types)
    3. Implement additional Json/Cpp conversion
  4. Ping devs to consider adding a python version for RunActionBase
#14

Updated by Rogers, Chris about 9 years ago

1. Removed legacy MICERun from the non-legacy (and some legacy) code in r795. If I find time I may dip into the legacy code and remove it altogether.

#15

Updated by Rogers, Chris about 9 years ago

Added generalised function to OutputCppRoot to write out any object (a) inheriting from MAUSEvent and (b) having an appropriate Converter. Implemented and checked outputter for JobHeader.

Now need to:

  • Implement JobFooter, RunHeader, RunFooter
  • Implement interface for everything to Go (partial implementation of the JobHeader is in place)
  • Implement serialisation for InputCppRoot and OutputCppRoot (at the moment trees are read/written in a non-serial way in ROOT)
  • Implement in InputPyJson and OutputPyJson
  • Document

The question I wondered about - do we make specific functions like "read/write_spill", "read/write_header"? Or do we make general function like "read_event"? Do we always read and write serially - i.e.

for event in events: 
  read_event()

or do we read and write explicitly header and footer i.e.

read_job_header()
for run in runs:
  read_run_header()
  for event in events: 
    read_spill()
  read_run_footer()
read_job_footer

Json naturally leans towards the former approach; root (where different types are stored in separate places) naturally pulls towards the latter approach. Worth sleeping on, but probably want our two data formats handled in the same way...

#16

Updated by Rogers, Chris about 9 years ago

So looks like it is all working now in single_threaded mode. I can read and write

  • JobHeader
  • RunHeader
  • Spill
  • RunFooter
  • JobFooter

and perform conversions between C++ and JSON representation. In the end I decided to serialise, this was for the reason that it is easier (coding) and faster (execution time) to serialise a parallel (ROOT-style) input than to parallelise a serial (JSON-style) input. Generation of job header and job footer is done by Go (I decided it didn't warrant a separate class)

Note I had to do the inputter and outputter birth, which is per job, separately to the mapper, reducer birth, which is per run. This is a slight change to what was written above.

Final clean up now:

  • I've only implemented I/O in single threaded mode, need to implement appropriate run loops in multithreaded also
  • Need to add an option to not append headers and footers - e.g. for when we just want to make a direct copy of an input to the output for format changes etc. Normally Go does the job of appending headers and footers automagically.

Something I'm brewing on - at the moment JobFooter is only accessible by Go.py, but this makes it a bit useless (if no one can put data in, what's the point). Is there a better way? JobHeader is only accessible by Go.py but that's okay; RunHeader and RunFooter are accessible from the StartOfRunAction and EndOfRunAction - they need to be available to birth and death methods also somehow (some e.g. calibration initialisation goes on in there)...

#17

Updated by Rogers, Chris almost 9 years ago

So I'm getting ready to merge. The only thing I haven't got is access to StartOfRun and EndOfRun, StartOfJob and EndOfJob from user code except the StartOfRunAction and EndOfRunAction... I think this is something I will need... so better have a think about it.

Also look at documentation...

#18

Updated by Rogers, Chris almost 9 years ago

Check - the StartOfRunActions should occur before module birth() so that modules can update references t e.g. field maps etc. I guess that means EndOfRunAction should be after module death()

#19

Updated by Rogers, Chris almost 9 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

So:

  • Documentation is done.
  • Option to not append headers and footers added header_and_footer_mode
  • Multithreaded mode is okay but I came into contact with #1146 - I never resolved it, but json output is okay so I logged it as a bug in OutputCppRoot. We can't use OutputCppRoot in multithreaded mode for the moment. Also #1149.
  • Would like to make accessor for run header information to the map and reduce objects. The correct way I think is to hand the run header at birth(..), but it will be a change to the API that might be a bit awkward.

I think this stuff will have to go in at a later date though.

#20

Updated by Rogers, Chris almost 9 years ago

Committed to trunk as r817

#21

Updated by Rogers, Chris almost 9 years ago

  • Target version changed from Future MAUS release to MAUS-v0.4.0

Also available in: Atom PDF