Project

General

Profile

Actions

Persistency Discussion

(Rogers)

So we have decided to use Json data format for moving data around. This seems to be okay. Example of the sort of thing we might do:

{
  "mc":{
    "tracks":[]
    "mc_hits":[
      {
        "position":{"x":0., "y":0., "z":-5000.},
        "unit_momentum":{"x":0., "y":0., "z":1.},
        "energy":226.0,
        "particle_id":-13,
        "energy_depoisited":5.5,
        "generator":"tracker",
        "channel_id":{"tracker":1, "station":1, "channel":23}
      },
      {
        "position":{"x":0., "y":0., "z":-5000.},
        "unit_momentum":{"x":0., "y":0., "z":1.},
        "energy":226.0,
        "particle_id":-13,
        "energy_depoisited":0.0,
        "generator":"virtual",
        "channel_id":{"station":1}
      }
    ]
  },
  "errors":{
   "MapPyChrisRogers":["Chris Rogers says hello"], 
   "MapPySomething":["SomeError"]
  }
}

Comments:

  • This data is basically python code with subtle differences (e.g. booleans have lower case first letter as in "true" not "True")
  • Persistency is read into python using the json module.
  • Persistency is read into C++ using Json library. Additional error handling is provided by src/common/Interface/JsonWrapper.hh (r375), it might be easier just to do it yourself.
  • If we want to use another format (ROOT is obvious one) we make a converter Map

Criticisms:

  • There does not seem to be a mechanism for handling pointers/references. If someone wants to persist pointers we will have to work something out (probably convert to an int). Comment from Tunnell: pointers are part of the path to the dark-side. Don't give in! They will lead to headaches when, as happens in all experiments, people start wanting ntuples (or TNtuples).

Schemas

Json handles file IO, but it does not - and should not - tell us anything about the data structure. However, we want to define a data structure. For the example above, we have at the top level an "mc" branch and an "errors" branch. The "mc" branch consists of a "tracks" branch and a "hits" branch. The "hits" branch is a list of hits. Each hit has a "position", "unit_momentum", "energy", etc. Each "position" and "unit_momentum" is a three vector of "x", "y", "z".

What happens if two different developers decide that the data structure should look different? E.g. some one thinks hits belong in the track, some one else thinks they deserve their own branch... and then they try to make their code talk to each other and ... KABOOM!

To avoid this situation we invent a schema that defines a commonly agreed format of the Json data structure. We all use the schema to make sure our code can talk to the others.

Json Schema

At time of writing (Spring 2011) the concept of Json schemas are about 18 months old - i.e. it's all fairly fresh. There exists a common definition for what a Json schema should look like and some rather raw codes for actually verifying the schemas. validate.py doesn't seem to have any activate maintainer; it's something we might want to take ownership of.

Validating MAUS Schema

We use the validator.py code to do the validation of our Json schema (MapPyValidateSpill.py). This calls a somewhat hacked version of validate.py and a schema defined in src/core/SpillSchema.py. The SpillSchema is itself verified in test_MapPyValidateSpill.py, using src/core/SchemaSchema.py (a schema validating schema). I started working with an unsupported thing called validator.py to do the json validation. This was unsupported - so I moved instead to using something called validictory (thanks to Tunnell), within about 5 minutes found a bug which makes me nervous but at least the code is supported so someone else might fix it.

Further Thought About Schemas

Implicitly here, we are asking developers to define their schema. At the moment if they want to persist their data type, they have to open "SpillSchema.py" and edit it. One might be able to automagically generate a schema by defining some

static std::string GetSpillSchema();
function on each of the persistable objects. Then, for example, to define the schema for the stuff above, we do:
static std::string MC::GetSpillSchema() {
  std::string eh_schema = ErrorHandler::GetSpillSchema();
  std::string tracks_schema = Tracks::GetSpillSchema();
  schema = "{"type":"object", "properties":{"+eh_schema+","+tracks_schema+"} }";
}

Then we define the schema next to the code that contains the data.

  • How do we do abstraction/inheritance stuff?
  • How do we do python/C++ interface?

An Alternative

So the mentality for the thing discussed above is that we give developers a lot of freedom and we check everything at run time. If I want to get hits from my tracker, I do something like (C++)

Json::Value hit = spill["mc"]["hits"]["scifi"][0]
if (hit.type() != Json::objectValue)) 
  throw(Squeal(blah blah));

or whatever, depending on the schema. The alternative, that we explicitly aren't doing, is that we check everything at compile time - i.e. we do something like (C++)
Json::Value hit = spill->mc->hits->scifi[0]

Now, because we are really calling C++ data and functions, we get compile time checking. If I make a typo, the thing won't compile. This makes Rogers nervous. It's basically untyped mentality (python) vs typed mentality (c++). Puts a lot of pressure on management to ensure developers "tow the line" - and who has time for that?

An Orthogonal Alternative

Basically this is the same as XML, but with different formatting. Why not use XML?

XML equivalent:

<mc>
    <hit>
      <position x="0." y="0." z="5000."/>
      <unit_momentum x="0." y="0." z="5000."/>
      <energy value="226.0"/>
      <particle_id value=-13 />
      <energy_depoisited value=5.5/>
      <generator value="tracker"/>
      <channel_id tracker"1" station="1" channel="23"/>
    </hit>
    <hit>
      <position x="0." y="0." z="5000."/>
      <unit_momentum x="0." y="0." z="5000."/>
      <energy value="226.0"/>
      <particle_id value=-13 />
      <energy_depoisited value=5.5/>
      <generator value="virtual"/>
      <channel_id station="1"/>
    </hit>
</mc>
Json advantages
  • Less verbose
  • pythonic
  • Tunnell says xml parsers are a pain to use. I have no experience so can't comment.
XML advantages
  • More mature tool set
    • Automatic converters from xml schema <-> C++ classes
    • (immature) ROOT thing if we really want
    • Automatically validation inline in text editors
  • GDML => already going down XML route...

The consensus seems to be that json is a bit better for pushing data and xml is a bit better for pushing "documents", but not clear what that really means.

TODO

  • Pointer persistency
  • Data cards schema
  • ROOT converter
  • Schema generator?

Updated by Tunnell, Christopher over 10 years ago ยท 24 revisions