Project

General

Profile

MAUSOnlineReconstructionOverview » History » Version 5

Jackson, Mike, 15 March 2012 15:39

1 1 Jackson, Mike
h1. Online Reconstruction Overview
2
3 5 Jackson, Mike
"Manchego" is the online reconstruction components of MAUS. Manchego stands for "Manchego is Analysis No-nonsense Controlroom Helping Existentially Guided Onlineness".
4
5 4 Jackson, Mike
"Overview talk":https://micewww.pp.rl.ac.uk/attachments/816/20120209-MAUS-SSI-Status.ppt - 09/02/12 - contains a summary of the main concepts and architecture.
6 5 Jackson, Mike
7
h2. Parallel transformation of spills
8
9
Each spill is transformed, by which we mean the data within the spill is analysed in various ways and derived data added to the spill. Each type of analysis is carried out by a "transformer". Each spill is independant spills can be transformed in parallel.
10
11
Celery, an distributed asynchronous task queue for Python is used to support parallel transformation of spills. Celery uses RabbitMQ as a message broker to handle communications between clients and Celery worker nodes.
12
13
A Celery worker executes one or more sub-processes each of which execute tasks, in the case of MAUS the transforms. By default Celery spawns N sub-processes where N is the number of CPUs on the host, but N can be explicitly set by a user. Each sub-process can execute tasks.
14
15
Multiple Celery workers can be deployed, each of which with one or more sub-processes. Celery therefore supports highly-parallelisable applications.
16
17
MAUS configures Celery as follows:
18
19
* The framework uses reflection to get the names of the transforms the user wants to use e.g. @MapPyGroup(MapPyBeamMaker, MapCppSimulation, MapCppTrackerDigitization)@. This is termed a "transform specification".
20
* A Celery broadcast is invoked, passing the transform specification, the MAUS configuration and a configuration ID (e.g. the client's process ID).
21
* Celery broadcasts are received by all Celery workers registered with the RabbitMQ message broker.
22
* On receipt of the broadcast, each Celery worker:
23
** Checks that the client's MAUS version is the same as the workers. If not then an error is returned to the client.
24
** Forces the transform specification down to each sub-process.
25
** Waits for the sub-processes to confirm receipt.
26
** If all sub-processes update correctly then a success message is returned to the client.
27
** If any sub-process fails to update then a failure message, with details, is returned to the client.
28
** Each Celery sub-process:
29
*** Invokes @death@ on the existing transforms, to allow for clean-up to be done.
30
*** Updates their configuration to be the one received.
31
*** Creates new transforms as specified in the transform configuration.
32
*** Invokes @birth@ on these with the new configuration.
33
*** Confirms with the Celery worker that the update has been done.
34
** Celery workers and sub-processes catch any exceptions they can to avoid the sub-processes or, more seriously, the Celery worker itself from crashing in an unexpected way.
35
36
MAUS uses Celery as follows:
37
38
* A Celery client-side proxy is used to submit the spill to Celery. It gets an object which it can use to poll the status of the "job".
39
* The client-side proxy forwards the spill to RabbitMQ.
40
* RabbitMQ forwards this to a Celery worker.
41
* The Celery worker picks a sub-process.
42
* The sub-process executes the current transform on the spill.
43
* The result spill is returned to the Celery worker and there back to RabbitMQ.
44
* The MAUS framework regularly polls the status of the transform job until it's status is successful, in which case the result spill is available, or failed, in which case the error is recorded but execution continues.
45
46
h2. Document-oriented database
47
48
After spills have been transformed, a document-oriented database, MongoDB, is used to store the transformed spills. The database represents the interface between the input-transform and merge-output phases of a spill processing workflow.
49
50
Use of a database allows many merge-output clients to use the same data.
51
52
The MAUS framework is given the name of a collection of spills and reads these in order of the dates and times they were added to the database. It passes each spill to a merger and then takes the output of the merger and passes it to an outputter.
53
54
h2. Histogram mergers
55
56
57
Aggregate spill data and update histogram
58
Super-classes for graph packages
59
Matplotlib – ReducePyMatplotlibHistogram
60
PyROOT – ReducePyROOTHistogram
61
Examples:
62
ReducePyHistogramTDCADCCounts
63
ReducePyTOFPlot (Durga)
64
Mergers do not display the histograms
65
66
Configuration options
67
Image type e.g. EPS, PNG, JPG,…
68
Refresh rate e.g. output every spill, every N spills
69
Auto-number image tag
70
Output JSON document
71
Base64-encoded image data
72
Image tag used for a file name
73
Meta-data e.g. English description
74
75
Output images
76
77
78
OutputPyImage
79
Configuration options
80
Filename prefix
81
Directory
82
Extract and save base64-encoded image data
83
Image file e.g. EPS, PNG, JPG,…
84
85
86
87
Django
88
Python web framework
89
Refresh every 5 seconds
90
Currently using Django test web server
91
Serve up images from a directory
92
“API” between online reconstruction framework and web front-end is just this directory
93
Can run web-front end anywhere so long as images are made available “somehow”
94
95
96
97 3 Jackson, Mike
98 1 Jackson, Mike
h2. Run numbers
99
100
Run numbers are assumed to be as follows:
101
102
* -N : Monte Carlo simulation of run N
103
* 0 : pure Monte Carlo simulation
104
* +N : run N
105
106
h2. Transforming spills from an input stream (Input-Transform)
107
108
This is the algorithm used to transform spills from an input stream:
109
<pre>
110
CLEAR document store
111
run_number = None
112
WHILE an input spill is available
113
  GET next spill
114
  IF spill does not have a run number
115
    # Assume pure MC
116
    spill_run_number = 0
117
  IF (spill_run_number != run_number)
118
    # We've changed run.
119
    IF spill is NOT a start_of_run spill
120
      WARN user of missing start_of_run spill
121
    WAIT for current Celery tasks to complete
122
      WRITE result spills to document store
123
    run_number = spill_run_number
124
    CONFIGURE Celery by DEATHing current transforms and BIRTHing new transforms
125
  TRANSFORM spill using Celery
126
  WRITE result spill to document store
127
 DEATH Celery worker transforms
128
</pre>
129
If there is no initial @start_of_run@ spill (or no @spill_num@ in the spills) in the input stream (as can occur when using @simple_histogram_example.py@ or @simulate_mice.py@) then spill_run_number will be 0, run_number will be None and a Celery configuration will be done before the first spill needs to be transformed. 
130
131
Spills are inserted into the document store in the order of their return from Celery workers. This may not be in synch with the order in which they were originally read from the input stream.
132
133
h2. Merging spills and passing results to an output stream (Merge-Output)
134
135
This is the algorithm used to merge spills and pass the results to an output stream:
136
<pre>
137
run_number = None
138 2 Jackson, Mike
end_of_run = None
139 1 Jackson, Mike
is_birthed = FALSE
140
last_time = 01/01/1970
141
WHILE TRUE
142
  READ spills added since last time from document store
143 2 Jackson, Mike
  IF spill IS "end_of_run"
144
    end_of_run = spill
145 1 Jackson, Mike
  IF spill_run_number != run_number
146
    IF is_birthed
147 2 Jackson, Mike
      IF end_of_run == None
148
          end_of_run = {"daq_event_type":"end_of_run", "run_num":run_number}
149
      Send end_of_run to merger
150 1 Jackson, Mike
      DEATH merger and outputter
151
    BIRTH merger and outputter
152
    run_number = spill_run_number
153 2 Jackson, Mike
    end_of_run = None
154 1 Jackson, Mike
    is_birthed = TRUE
155
  MERGE and OUTPUT spill
156
Send END_OF_RUN block to merger
157
DEATH merger and outputter
158
</pre>
159
160
The Input-Transform policy of waiting for the processing of spills from a run to complete before starting processing spills from a new run means that all spills from run N-1 are guaranteed to have a time stamp < spills from run N.
161
162
is_birthed is used to ensure that there is no BIRTH-DEATH-BIRTH redundancy on receipt of the first spill.
163
164
h2. Document store
165
166
Spills are stored in documents in a collection in the document store. 
167
168
Documents are of form @{"_id":ID, "date":DATE, "doc":SPILL}@ where:
169
170
* ID: index of this document in the chain of those successfuly transformed. It has no significance beyond being unique in an execution of Input-Transform loop below. It is not equal to the spill_num (Python @string@)
171
* DATE: date and time to the milli-second noting when the document was added (Python @timestamp@)
172
* DOC: spill document. (Python @string@ holding a valid JSON document)
173
174
h3. Collection names
175
176
For Input-Transform,
177
178
* If configuration parameter @doc_collection_name@ is @None@, @""@, or @auto@ then @HOSTNAME_PID@, where @HOSTNAME@ is the machine name and @PID@ the process ID, is used.
179
* Otherwise the value of @doc_collection_name@ is used.
180
* @doc_collection_name@ has default value @spills@.
181
182
For Merge-Output,
183
184
* If configuration parameter @doc_collection_name@ is @None@, @""@, or undefined then an error is raised.
185
* Otherwise the value of @doc_collection_name@ is used.
186
187
h2. Miscellaneous
188
189
* Currently Celery timeouts are not used, transforming a spill takes as long as it takes.
190
* Celery task retries on failure option is not used. If the transformation of a spill fails first time it can't be expected to succeed on a retry.
191
* If memory leaks arise, e.g. from C++ code, look at Celery rate limitss, which allow the time or number of tasks before sub-process is killed and respawned, to be defined. Soft rate limits would allow @death@ to be run on the transforms first.