Project

General

Profile

Bug #1595

Error Reconstructing MICE Data

Added by Nugent, John over 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Category:
Data Structure
Target version:
-
Start date:
03 December 2014
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

Using MAUS 0.9.1 I attempt to reconstruct MICE data and MAUS returns the following error for some events.

Traceback (most recent call last):
File "/data/neutrino05/jnugent/develG4beamline/src/common_py/ErrorHandler.py", line 159, in HandleCppException
raise CppError(error_message)
ErrorHandler.CppError: In branch recon_events
In branch part_event_number
Missing required branch part_event_number converting json->cpp at ValueItem::_SetCppChild

I am using the script analyze_data_offline.py with the standard MAUS data card and running over data from run 3407.


Files

maus_output.root (1.7 MB) maus_output.root ROOT file, 50 events from 3407.009 Rajaram, Durga, 11 December 2014 03:10
kl_cell_3407_009_x50.png (11.6 KB) kl_cell_3407_009_x50.png KL Cell hits plot, 50 events from 3407.009 Rajaram, Durga, 11 December 2014 03:10
3407_009_log.txt (8.95 KB) 3407_009_log.txt Log file from running on 50 events from 3407.009 Rajaram, Durga, 11 December 2014 03:10
json_3407_009_1.txt (9.38 MB) json_3407_009_1.txt run 3407.009, physics event 1, showing KL data problem Rajaram, Durga, 12 December 2014 01:17
#1

Updated by Karadzhov, Yordan over 7 years ago

  • Category changed from RealData to Data Structure
  • Assignee changed from Nugent, John to Rajaram, Durga
  • Priority changed from Normal to High

I don't understand this error message. I tried turning OFF all the maps and the output and running only with the input, but I still see this error. I also looked in the DAQ data of the spills in which this error appears and I can't see anything unusual.

There is also a second problem which is even more serious. When try to runing over all the data files of this run (run 3407 has 30 data files) I found that the reconstructions is progressing until event 4421. Then something is happening and the process stops without even giving an error message. This is happening somewhere in the middle of file 03407.025. But then when I tried processing this particular file, everything was OK and the whole file was processed.

#2

Updated by Rajaram, Durga over 7 years ago

Looking into it....so far have run through 001 without any errors.
John or Yordan -- can you add the run log (if you have it) to the issue tracker? Or point me to the subrun file which threw the error?

#3

Updated by Karadzhov, Yordan over 7 years ago

Correction of my previous post, considering the first problem (the error message).
With all the maps OFF, the problem is gone. Actually, it looks like the problem is a conflict between the 2 KL maps and MapCppEMRPlaneHits. If I change the order of appending the maps to MapPyGroup and put the EMR map before the 2 KL maps, then everything is OK.

BTW the first file that gives this problem is 03407.009

#4

Updated by Dobbs, Adam over 7 years ago

Thanks for your efforts all, it would be great if we can resolve this as quickly as possible as I'm told it is holding up the pion paper, and we will need to get the fix in to a release before it can be used for a publication.

#5

Updated by Nugent, John over 7 years ago

Hi Durga,

I don't have a log from running but I notice this error on every file I try to run over.

Yordan, I applied the fix you suggested by running without the EMR mapper and over each individual sub file of the dataset however I still see the same error with run 3407. I have now also run over 3506 and see a third error:

Traceback (most recent call last):
File "/data/neutrino05/jnugent/develG4beamline/src/common_py/ErrorHandler.py", line 159, in HandleCppException
raise CppError(error_message)
ErrorHandler.CppError: Failed to read next event at InputCppDAQData::_emit_cpp()

Is this an error? or is this the normal way for the job to exit at the end of the input file?

Cheers,
John

#6

Updated by Rajaram, Durga over 7 years ago

Nugent, John wrote:

Hi Durga,

I don't have a log from running but I notice this error on every file I try to run over.

  • I see the error from some subfiles, not all. 3407.000 and 3407.001 do not give me any errors
    Tried with both fresh pull off the trunk, and with frozen MAUS-v0.9.1
  • I do see the error with 3407.009 as Yordan pointed out.

Yordan, I applied the fix you suggested by running without the EMR mapper and over each individual sub file of the dataset however I still see the same error with run 3407.

  • This one is a bit more strange. I tried Yordan's fix of moving the EMR mapper to before the KL mappers.
    I get the error -- but then it goes away if the outputter is changed to OutputPyJSON
    But I don't think it's the EMR -- 3407 obviously doesn't have EMR data as far as I know.
    Could be a red herring but I was able to narrow it down to some interplay with the MapCppKLDigits mapper.
    • With the default analyze_data_offline, turning off the KL and EMR, the error goes away.
    • Turn the MapCppKLDigits on and it pops up. But I don't yet see why it's causing that, and that too why only for some (sub)runs
  • Despite the errors, the job goes through and there is an output ROOT file -- have you looked at that to see if it's still sensible?
    • I see events and digits in there, and the event/spill count is sensible to me but haven't gone through to analyze it.
    • See attached ROOT file and KL cell hits plot
    • This is from 50 events of 3407.009, log attached
      • bin/analyze_data_offline.py -daq_data_path ~/data -daq_data_file 03407.009 --Number_of_DAQ_Events=50

        I have now also run over 3506 and see a third error:

Traceback (most recent call last):
File "/data/neutrino05/jnugent/develG4beamline/src/common_py/ErrorHandler.py", line 159, in HandleCppException
raise CppError(error_message)
ErrorHandler.CppError: Failed to read next event at InputCppDAQData::_emit_cpp()

Is this an error? or is this the normal way for the job to exit at the end of the input file?

  • That is benign and has to do with the emitter confused by an empty end of data.
#7

Updated by Rajaram, Durga over 7 years ago

The problem comes from the DAQ data/unpacking.

1. In some spills there are more V1724 KL events than there are V1290 trigger or trigger_request events
2. As a result, the "extra" KL events are not properly associated with a recon_event
  • the recon_event array is initialized based on the number of particle triggers (v1290), so trying to stuff the extra KL events beyond that results in those events being improperly set, e.g. no part_event_number, which is what the error message finally shows when converting to ROOT and why it doesn't show up if you do a json output

3. Yordan -- I don't know why this is, whether it's in the data, or from some unpacking glitch - can you look? I also don't know if the timing of these extra 1724 kl hits is meaningful -- i.e. are they really associated with the trigger? You might be able to make more sense of it.

4. I have attached a json output (human readable) from 1 physics event [ the first physics_event in 03407.009 ] that shows the problem
  • the part_event_number in the V1290 trigger data goes up to 101
  • the part_event_number in the V1724 KL data goes up to 104 (also note that KL.V1724 event 102 is empty)
  • The json was generated by:
    bin/analyze_data_offline.py -daq_data_path ~/data -daq_data_file 03407.009 --Number_of_DAQ_Events=5

    with the output module set to OutputPyJSON() in analyze_data_offline.py
5. Since the data is what it is, I can think of three solutions:
  • a) if Yordan finds that it's in the unpacking, then fix it.
  • b) if it's not the unpacker, then modify KLDigit mapper to ignore any KL daq-data beyond the trigger array size
  • c) Live with the error message and continue using the extra KL events

Since I don't know if I can have any confidence in hits that came in beyond the trigger, my preference would be solution a) or b)

Reassigning to Yordan so he can take a look.

#8

Updated by Karadzhov, Yordan over 7 years ago

I can confirm that the problem is in the data and that this problem makes the corrupted data useless (in principle it can be recovered but this will be tricky). I am working on an update of the unpacking which will be able to detect the problem at earlier stage.
I have tested also runs 3506, 3507 and 3509, and I can confirm that 3506 and 3509 are OK, Run 3507 starts having the same problem in file 03507.014. If you send me the list of all runes used in the analysis I can test them all.

#9

Updated by Nugent, John over 7 years ago

Several runs were used in this analysis including: 3509, 3253, 3426, 3250, 3261, 3250 & 3454. For the time being I will use the runs without the error and presumably the fix will be in the next MAUS release?

Thanks for looking into this Durga, Yordan.

#10

Updated by Rajaram, Durga over 7 years ago

Yordan -- is it just the files/spills with the error which are corrupt [ to be thrown away ]?
Or do you think there's corruption in the entire run in which an error comes up?
Can one use, for e.g., 3407.000-008 where there are no errors?

Secondly, is it something that you can catch at the event building stage? In other words, what is the way to make sure we don't take corrupted data when we are running next year?

#11

Updated by Karadzhov, Yordan over 7 years ago

The problem is in the event builder. The data starts being corrupted after the first error.

#12

Updated by Rogers, Chris over 7 years ago

Yordan, could this affect any other data? Ones not used in John's analysis?

#13

Updated by Nugent, John over 7 years ago

Durga I'm would like all of the run which I have mentioned to be reprocessed in order to complete the PID paper. Given the number of runs and the required CPU time I cannot complete this myself on the Glasgow batch system. I understand that the data processing is a Grid job and uses our resources there. Do you know what the procedure is to get this reprocessing done or who to talk to? Is Janusz in charge of this task?

Cheers,
John

#14

Updated by Karadzhov, Yordan over 7 years ago

I have tested all the runs from the list and I found the following:

3250 - OK
3253 - OK
3261 - OK
3407 - OK up to file 03407.007
3426 - OK up to file 03426.016
3454 - OK
3506 - OK
3507 - OK up up file 03507.013
3509 - OK

I will try making a data recovery program in order to save the broken runs

#15

Updated by Rajaram, Durga over 7 years ago

Thanks, Yordan.

John - Janusz handles the GRID (re)processing. Is your intention to reprocess, and exclude the bad files? Do you need to have it done with the current MAUS - 0.9.2?

#16

Updated by Nugent, John over 7 years ago

Including all the files would be the ideal however reprocessing what we can will have to do given the time constrains. Yes the processing needs to be done with the latest MAUS version. I'll send a message to Janusz and see what he says.

Thanks

#17

Updated by Nebrensky, Henry over 7 years ago

Batch Reprocessing is by definition done on all files within a given Step - technical issues aside, in the case of only covering a subset this is likely to cause confusion in the future when some other user finds their runs missing.

Is the request that Janusz install 0.9.2 on the Grid and re-process Step1 with it?
(Probably should be in #1408 or a new ticket)

#18

Updated by Nugent, John over 7 years ago

I have already emailed Janusz and asked him to re-process the data with MAUS v 0.9.1 as that is the version currently installed on the Grid. For the pion contamination paper only a subset of runs are required so getting those back is the priority.

#19

Updated by Nebrensky, Henry over 7 years ago

But reprocessing with 0.9.1 was done back in October - I can see the output for 3507 and 3509 at least.

Any run that crashed 0.9.1 in October will still crash it in December. What I read from the ticket above is that it needs a MAUS release that at least detects the corruption and stops cleanly, and that doesn't exist yet.

#20

Updated by Nugent, John over 7 years ago

Yordan do you have an estimate on how far away are we from having a fix for this bug in the MAUS trunk?

#21

Updated by Dobbs, Adam over 7 years ago

Yordan, could you give me on the present status of the fix please? It should be done as soon as possible, as it is now delaying the PID paper. Thanks.

#22

Updated by Karadzhov, Yordan over 7 years ago

The data recovery program is ready. To get the code do
bzr branch lp:unpacking-mice

Then follow the instructions in the README.txt file. When you have the code built, you should have an executable
unpacking-mice/bin/mdrebuild

to use the recovery program do

./mdrebuild -d /path/to/the/data -f XXX
where XXX is the run number.

The recovered data will be recorded in files named XXX.9YY (XXX is again the run number and YY is the number of the file)

#23

Updated by Dobbs, Adam over 7 years ago

Thanks Yordan. Let me summarise where I think we are:

  • We have corrupt data
  • We have a recovery programme in bzr branch lp:unpacking-mice
  • We do not have a way for MAUS to detect a problem run and deal with it automatically

Is the plan then to merge the recovery programme into the trunk and perform a release, and then replace the corrupt data with the cleaned up version from the recovery programme?

Durga, Chris, John is this solution satisfactory for your needs? Once I have done the release who would create the cleaned up data on the Grid? Henry?

#24

Updated by Rogers, Chris over 7 years ago

I think physics group needs more information about what went wrong. I need a document detailing:- what was the nature of the error; how was the fix implemented; how was the fix validated; what steps are being taken to ensure this error will not be repeated. Durga, can you coordinate this? Bad data which got through all the various software and computing checks is really scary for me. Why was this not picked up in the MLCR? Why was it not picked up during batch reconstruction?

#25

Updated by Dobbs, Adam over 7 years ago

Correction: I notice Yordan's repair programme is not a part of MAUS, so we can skip the merging into a release part, and just use the programme directly.

#26

Updated by Dobbs, Adam about 7 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

I think this has now been fixed for awhile (at MAUS 0.9.5), closing the issue...

#27

Updated by Nebrensky, Henry about 7 years ago

Sorry - I didn't see the question earlier...

It might not be that simple - the last Batch Reprocessing of Step 1 data was done last October with MAUS 0.9.1 which fails on the corrupted data.

So there is no complete set of fully-reconstructed Step 1 data available.

If MICE wants that data visible, then either
  • the data recovery needs be embedded within MAUS so that data is corrected automagically, or
  • the data recovery needs to be available as a 3rd-party in a MAUS release, and the Grid reconstruction script tweaked so as to correct the data locally before running the usual reconstruction over it

and someone then has to ask Janusz to re-run the Batch Reprocessing with the newer MAUS over Step 1.

Possibly one of the first two options has already been done, but certainly the last bit hasn't!

(This has since been reopened as #1702)

Also available in: Atom PDF