PC050613 new online tests

New online tests were added to the testing framework to check that we can run the online reconstruction. Several bugs were resolved. The aim is to make a test at integration level to check that we can correctly integrate:

  • celery: multiprocessing on several different computers in parallel
  • input-map: run on multiple celery nodes and write reconstructed data to mongodb
  • mongodb: transient data store for reconstructed MICE data
  • reduce-output: read from mongodb and make a bunch of png files
  • maus-app: web viewer for viewing output plots

This addresses issues

celery was reverted to version 2.5.5 from 3.x - apparently 3.x ran much more slowly for no apparent reason
  • As a corollary, the easy_install bash script third_party/bash/40python_extras.bash was tidied up; it is now controlled by command line switch.
  • Unfortunately easy_install has no way of uninstalling or down-versioning packages, and the bash script has not good version control. So I just delete all the packages (incl celery 3.x) pulled down from the third_party tarball on micewww. It means the development branch relies on the easy_install servers again until the next time I do a release.
analyze_data_online had some improvements. This is the code that provides launcher application for various reconstruction processes
  • We now wait for reducers to finish processing data before killing the process
  • We check for celeryd processes and kill them before running
  • I fixed a bug in mongodb monitoring function (looks like output format of validate has changed from online version - I at least catch the resultant exception)
  • Arguments on the command line get passed to MAUS processes (both input-transform and merge-reduce)
I changed the way we handle some of the details in pymongo and the general online process handling
  • I tried harder to purge mongodb at the beginning of a run
    • Previously we drop (delete) the "collection" (equivalent to e.g. SQL table) at the beginning of the run. This did not appear to adequately clear the database leading to errors and possible cross talk between runs and even separate executions of the reconstruction (the DB can persist between jobs).
    • Now we drop (delete) the entire "database" at the beginning of each run.
    • I never tested for multiple consecutive runs, so worry that the end of run may not come through properly. Needs a test here.
  • I changed the way we handle KeyboardInterrupt in the merge_output; now merge_output will continue reading off the database until it has read all documents in the database. This corresponds to processing the last run.
  • I catch pymongo errrors; print to stderr but keep going. pymongo errors can occur e.g. because we started a new run, dropped the database but merge_output is still trying to reconstruct the old run. merge_output will hit the exception from pymongo and handle it properly (but print to stderr).
I added/modified tests to run the online reconstruction and parallel mode in general
  • tests/integration/test_distributed_processing/ will check that online reconstruction is installed properly and fail if there is a problem. This is only run if specified; all the other tests should skip if online is not installed properly.
  • tests/integration/test_distributed_processing/ will run a MC job in multiprocessing mode and check that the MC job returns correct number of spills and run numbers are correct. Here we are not looking to validate the MC, only we want to check that there are no duplicate events or missing events, the start and end of run is set up properly, etc.
  • tests/integration/test_analyze_data_online/ will run and check that the reconstructed output is the same as some reference data
    • maus-app is now a third_party package (installed at third_party/install/lib/maus-apps-<version>). This is the thing that makes a web browsable output from online stuff. For now I just want to run without crashing.
    • InputCppDAQOnline was modified to enable setting offline file mimic from datacards
    • A third_party script was added that downloads a data file for online reconstruction. I randomly chose run 04235.
    • ReducePyCkov and ReducePyTOF were modified to output a big root file containing all of the reducer histograms as well as the regular PNG files
    • A reference set of histograms was added; online code attempts to reconstruct 04235.000 and compare reconstructed output with reference output (KS test). Test fails if the reconstructed output is wrong.
  • passing in online mode on heplnm071, failing in offline mode on heplnm070
  • I would like to modify InputCppDAQOnline further to string together multiple reconstruction runs; this can test the end of run and start of run procedure; we can also make some load test of the online reconstruction.
  • Need to provide updated documentation for this code once it is ready for deployment (need to make a release cycle first).

Updated by Rogers, Chris almost 11 years ago ยท 15 revisions