Project

General

Profile

Bug #1190

MongoDB timeout

Added by Rogers, Chris almost 9 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Online reconstruction
Target version:
Start date:
13 December 2012
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

I ran the online recon overnight and received the following error

Ending job
Clearing Globals
Traceback (most recent call last):
  File "/home/mice/MAUS/.maus_control-room/bin/online/reconstruct_daq_tof_reducer.py", line 71, in <module>
    run()
  File "/home/mice/MAUS/.maus_control-room/bin/online/reconstruct_daq_tof_reducer.py", line 68, in run
    MAUS.Go(my_input, my_map, reducer, output_worker, data_cards) 
  File "/home/mice/MAUS/.maus_control-room/src/common_py/Go.py", line 131, in __init__
    self.get_job_footer())
  File "/home/mice/MAUS/.maus_control-room/src/common_py/framework/merge_output.py", line 281, in execute
    raise DocumentStoreException(exc)
docstore.DocumentStore.DocumentStoreException: Exception when using document store: cursor id '7857742081767663573' not valid at server

From:
http://api.mongodb.org/python/current/faq.html

What does OperationFailure cursor id not valid at server mean?

Cursors in MongoDB can timeout on the server if they’ve been open for a long time without any operations being performed on them. This can lead to an OperationFailure exception being raised when attempting to iterate the cursor.

May be related to a crash in the DAQ that happened during the same run (possibly at some point DAQ crapped out; and then MAUS sat waiting for spills and eventually gave up). MAUS should run indefinitely without receiving data however.

#1

Updated by Rogers, Chris almost 9 years ago

Email from Yagmur:

It threw a fit around 18:30,

    FATAL!!! from V1290(GEO1): Trigger mismatch(nEvts 0!=3224).

I started it up again. I first tried using run control (which Pierrick had restarted remotely) but it froze during the initial configuration dialog. 

Looking at time stamps on the log file, looks like MAUS stopped around 19:09. Annoyingly, I accidentally overwrote the full log file however (sorry).

#2

Updated by Rogers, Chris over 8 years ago

  • Assignee changed from Richards, Alexander to Rogers, Chris

Looks like there is a general instability in MongoDB that is uncovered by integration tests. Current workaround is to catch the error thrown by MongoDB and then continue processing...

#3

Updated by Rogers, Chris over 8 years ago

Looks like there are occasional problems in reading the database during merge_output cycle.

I edited MongoDBDocumentStore to raise a DocumentStoreError in the case that get, get_since makes an error. I changed the call structure in merge_output to pass on a DocumentStoreError and attempt to continue iteration - i.e. if the get_since() call fails merge_output will ignore the fail and wait for new data rather than crashing.

In the same set of changes I also changed the KeyboardInterrupt (ctrl-c) handling so that merge_output will finish processing any data in the docstore before exiting. If there is a backlog this can cause a problem... and user will have to use SIGKILL (e.g. kill <pid> from the command line).

#4

Updated by Rogers, Chris over 8 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

We now catch exceptions arising from the document store. This sort of error can happen because e.g. the doc store was purged and the current reducer event is no longer valid.

Fixed in r947

#5

Updated by Rogers, Chris over 8 years ago

  • Target version changed from Future MAUS release to MAUS-v0.5.4

Also available in: Atom PDF