Data Structure

As I mentioned at the MAUS phone call last time, I am in the process
of reviewing the MAUS data structure. I wanted to share a few thoughts
and understand what folks view as priorities.

I uploaded a few slides summarising
  • the existing data structure (slide 1)
  • how that might look in the end (slide 2);
  • and a proposed data structure that one might consider (slide 3/4).

Double arrow means many to one relationship... I hope I got the direction right.

Currently we slice data into spills and then slice by "reconstruction
step". So typically for our detectors we have fairly generic steps in
the reconstruction - first we extract data from the daq, then turn that
into digits (pmt hits), then associate digits into groups (clusters or
slab hits), then make space points and finally tracks.

There is a concept of a run, this is an implicit part of the data
structure and actually represented by a particular set of files on disk
(we don't store the whole run in memory, it would be too big).

Presumably the end point of this would be pretty much the same, but with
an extra "tracks" branch (once we have tracker or global
reconstruction). Tracker "clusters" would go in the slab_hits branch I
guess, or perhaps we make a new branch.

The problems with the existing data structure:

  • On a management level, the detector subgroups might object to having
    quite a rigid data structure that mixes their data with everyone else's data
  • It's pretty tough to figure out for each particle event - what are the
    digits/slab_hits/clusters/etc pertaining to this particle. I think
    that after the first step in reconstruction has occurred, we want to
    really work on a particle-by-particle basis. Under the current data structure
    the first step in almost any reconstruction or analysis is to search the data
    structure and construct a particle event. So we have set up a data structure
    that we have to fight to do any reconstruction - this is madness.
  • It's pretty tough to figure out for each e.g. slab hit - what are the
    digits that were used to reconstruct this slab hit. One might, for
    example, want to study how many channels contribute to the clusters in
    the tracker (to look for cross-talk or noise) or whatever.
  • We have no way to handle per-run calibration data really in the
    existing data structure
  • We have no where to store run metadata in the existing data structure
    (configuration, timestamps, code version, etc)
The plus side with the existing data structure:
  • I guess for global reconstruction one probably wants to e.g. Kalman
    filter over all the space points. In the existing data structure, it's
    pretty trivial to extract all the space points to do this.
  • We have some code already prepared that works in this structure (TOF

So I propose a new data structure in slide 3 (and 4) that might address some of these
issues. Here we split by recon_particle_event (defined by the set of triggers
that caused the event) and then by detector. The monte carlo data is still
kept in a separate branch - as this is really very separate from the
reconstruction tree. We should probably make a dedicated phone call sometime -
or perhaps discuss at next MAUS phone call - but I wanted to throw this idea
around and let people think on it.

Note explicitly that the only difference between 3 and 4 is that the mc branch in the spill
is missing for real data

  • Calibration data structure - clarify
  • Reduced data structure - clarify

Updated by Rogers, Chris about 9 years ago ยท 5 revisions