Project

General

Profile

Bug #1677

execute_against_data return code 2 on Grid

Added by Nebrensky, Henry over 6 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
bin(aries)
Target version:
Start date:
06 May 2015
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

We need a way to catch jobs failing on the Grid with a return code of 2, and arrange to get their output back to experts for debugging.

Suggestion was e.g. a shellscript wrapper around execute_against_data.py to ensure that an NNNNN_offline.tar is always built and returned - see below.


Related issues

Related to MAUS - Feature #1656: execute_against_data - md5sum in the output please?OpenDobbs, Adam26 March 2015

Actions
#1

Updated by Nebrensky, Henry over 6 years ago

Two parts:

  • need to ensure that there is always a tarball and return status code to the Grid wrapper
    • either another shell wrapper or internally within execute_against_data.py
  • Returned status code to be stored within MetaDB and thence pulled into CDB, to identify unusable data.
    • this pulling is also needed for the long-awaited status flags, etc.

This ticket is for the first part, which I think MAUS should own. This would automatically make output from failed jobs available, but without any way to identify the runs in question without opening the tarballs.

The second step is a Grid task - it needs to wait until other CDB work is done before effort is available.

> On 07/04/15 16:25, Henry Nebrensky wrote:
> > Hi,
> >
> > I think the correct solution is that MAUS should provide its own shell
> > wrapper, which takes any appropriate action in response to the MAUS
> > return code and then builds the usual 062nn_offline.tar. This I think
> > would also catch some of the cases where MAUS gets killed hard because
> > of a memory leak.
> >
> > i.e. the MAUS wrapper should
> > - receive instructions (e.g. run number) from Janusz' wrapper
> > - run MAUS
> > - catch the return code from MAUS
> > - do any appropriate diagnostics (e.g. disk space?)
> > - do the checksums that have been requested
> > - pack all available output into a tarball with the usual name
> > - terminate, passing on the MAUS return code to Janusz
> >
> > I think this is a MAUS thing, to keep a clear demarcation between Grid
> > and MAUS.
> >
> > It also points up that we need to work out how to get the failure codes
> > back to the physics shifter, and we might want to think about error
> > codes related to Grid running, although e.g. no disk space left would
> > fit under MAUS' 1.
>
> I would be happy if there was a file in the tarball that contained an error code, at least this catches the MAUS failure modes.
>
> I guess the advantage of a bash script is that it would be thinner, so easier to debug/less prone to bugs, i.e. some lines like:
>
> - check MAUS_ROOT_DIR exists
> - python ${MAUS_ROOT_DIR}/path/to/execute_against_data.py
> - append return code to the tar ball
>
> Is that what you had in mind Henry?

That's about it...

... although I would tend to have the shell script do all the tarballing. My reasoning is that if python ${MAUS_ROOT_DIR}/path/to/execute_against_data.py
then decides to use 12GB of memory and is brutally terminated by the batch system, the shell script as a separate process can still collect the wreckage and return something for the developers.
Though I admit this is based on a very out-of-date understanding of the details behind Grid job submission - hence my trying to organise the Grid workshop to flush any experts out of the woodwork!

Thanks

Henry
#2

Updated by Dobbs, Adam over 4 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

I think everyone is now happy with this.

Also available in: Atom PDF