Project

General

Profile

Bug #1328

Memory leak in MAUS online recon

Added by Rogers, Chris over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Online reconstruction
Target version:
Start date:
02 August 2013
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

top shows

15467 mice      15   0 2511m 1.5g  47m R 54.5 40.1  80:04.60 
python                                                                                                                                                             

ps -eF shows

$> ps -eF | grep 15467
mice     15467 15458 79 642874 1620976 3 12:15 pts/1   01:20:22 python /home/mice/MAUS/.maus_release/bin/online/analyze_data_online_input_transform.py -mongodb_database_name=maus-new -type_of_dataflow=multi_process_input_transform -verbose_level=0 -DAQ_hostname=miceraid5

Looks like we have a leak...


Files

mem_leak.log (10.9 MB) mem_leak.log Rogers, Chris, 02 August 2013 15:14
test_analyze_data_online.tar.gz (86.4 MB) test_analyze_data_online.tar.gz Rogers, Chris, 18 August 2013 07:03

Related issues

Related to MAUS - Bug #1330: Online reconstruction not catching change of run number in histogram production.ClosedRogers, Chris05 August 2013

Actions
#1

Updated by Rogers, Chris over 7 years ago

Running test script against trunk yields following leaks from MAUS namespace (full valgrind log attached)

==28447== Conditional jump or move depends on uninitialised value(s)
==28447==    at 0x10E3FA11: MAUS::VirtualPlaneManager::ConstructVirtualPlanes(BTField const*, MiceModule*) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==28447==    by 0x10E393F4: MAUS::MAUSGeant4Manager::MAUSGeant4Manager() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==28447==    by 0x10E39530: MAUS::MAUSGeant4Manager::GetInstance() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==28447==    by 0x10E6A3DD: MAUS::GlobalsManager::InitialiseGlobals(std::string) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==28447==    by 0x1077F1B4: MAUS::PyGlobals::Birth(_object*, _object*) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/maus_cpp/globals.so)
==28447==    by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013)
==28447==    by 0x4F1FCE4: PyEval_EvalCodeEx (ceval.c:3253)
==28447==    by 0x4EA35EB: function_call (funcobject.c:526)
==28447==    by 0x4E7B7F2: PyObject_Call (abstract.c:2529)
==28447==    by 0x4E88F4E: instancemethod_call (classobject.c:2578)
==28447==    by 0x4E7B7F2: PyObject_Call (abstract.c:2529)
==28447==    by 0x4F18666: PyEval_CallObjectWithKeywords (ceval.c:3882)

==29734== 8 bytes in 1 blocks are possibly lost in loss record 8,130 of 68,074
==29734==    at 0x4C28B35: operator new(unsigned long) (vg_replace_malloc.c:261)
==29734==    by 0x14CF1D05: std::vector<G4VProcess*, std::allocator<G4VProcess*> >::_M_insert_aux(__gnu_cxx::__normal_iterator<G4VProcess**, std::vector<G4VProcess*, std::allocator<G4VProcess*> > >, G4VProcess* const&) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so)
==29734==    by 0x14CEF3D8: G4ProcessManager::InsertAt(int, G4VProcess*, int) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so)
==29734==    by 0x14CF0579: G4ProcessManager::SetProcessOrderingToFirst(G4VProcess*, G4ProcessVectorDoItIndex) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so)
==29734==    by 0x14F30A8F: G4VUserPhysicsList::AddTransportation() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so)
==29734==    by 0x14F2FA74: G4VModularPhysicsList::ConstructProcess() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so)
==29734==    by 0x10E353F0: MAUS::MAUSPhysicsList::ConstructProcess() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==29734==    by 0x14F25DC8: G4RunManagerKernel::InitializePhysics() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so)
==29734==    by 0x14F1F480: G4RunManager::InitializePhysics() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so)
==29734==    by 0x14F1F6BF: G4RunManager::Initialize() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so)
==29734==    by 0x10E393FD: MAUS::MAUSGeant4Manager::MAUSGeant4Manager() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
==29734==    by 0x10E39530: MAUS::MAUSGeant4Manager::GetInstance() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)

Lots of memory leaks in G4 apparently - looks like some sloppiness in the way Rogers runs the Globals... what a moron

#2

Updated by Rogers, Chris over 7 years ago

I have cleaned up G4 memory usage and this will be merged shortly. However, on second look it appears that the mongodb stuff is not being properly deleted. I have simulated the data taking for the running. My aim is to try to simulate online reconstruction for the entire weekend's running, then set this up as a regression test/load test.

#3

Updated by Rogers, Chris over 7 years ago

Going through test/integration/test_analyze_online_reconstruction/test_analyze_online_reconstruction.py using dataset consisting of all data from 4956-4997.

Process flow looks like:

  1. <head node> take input
  2. <head node> hand input to celery workers
  3. <celery nodes> do reconstruction
  4. <head node> poll celery workers for completed reconstructions
  5. <head node> hand completed reconstruction to mongodb

Try:

  • Break program flow at 2 (commenting framework.input_transform.submit_spill_to_celery) removes the memory leak
  • Disable 3 (commenting mappers in analyze_data_online_input_transform.py) leaves the memory leak (expected, this stops reconstruction happening on child process
  • Break program flow at 4 (commenting while loop in framework.input_transform.poll_celery_tasks) removes the memory leak
  • Break program flow at 5 (commenting self.doc_store.put(...) ) leaves memory leak in
#4

Updated by Rogers, Chris over 7 years ago

Nailed this down to line 154 of input_transform.py

                 self.celery_tasks.pop(current)

Why is that leaking? self.celery_tasks is a list. It doesn't leak if I don't pop the list...

#5

Updated by Rogers, Chris over 7 years ago

It looks like reconstruction algorithm returns two variables, failed() and successful(). If successful was false and failed was true the reconstruction never clears task from the queue. For some reason quite a bit of stuff is failing in this manner. So the memory leak can be fixed, but I would like to know what these two flags actually mean...

#6

Updated by Rogers, Chris over 7 years ago

I successfully ran through the whole weekends worth of data taking. Observations:

  • I checked roughly half way through the data taking. The processes looked to have leaked a little - they were up from 1.5% of mem usage to maybe 2%.
  • Mongodb was running at 389% of cpu usage! I was running on an Intel 4/8 core machine.
  • Log files report Spills input: 98174 Processed: 98830 Failed 28277 - bear in mind that this number includes start_of_burst and end_of_burst i.e. a lot of jobs are failing

I attach full logs and end of run histos.

#7

Updated by Rogers, Chris over 7 years ago

Okay, so this bug is now fixed. It turns out that if the result is not success or fail it is pending. I now handle that case properly I think. I am once again getting lots of these bloody

 too much data for sort() with no index.

errors from Mongo-sodding-DB.

#8

Updated by Rogers, Chris over 7 years ago

I did the sort (by reconstructed timestamp) in native python rather than using MongoDB calls and now it looks like MongoDB is more stable. I also add tests for most of merge_output.py (but not all, I am running out of time before upcoming user run). Looks like it is no longer leaking right now...

#9

Updated by Rogers, Chris over 7 years ago

This is looking darker and darker. It looks like a bug in celery stack. I tried updating to a new version of celery and now I am getting MySQL error (but I never use MySQL anywhere, though DAQ does surely new version of celery shouldn't screw that up?).

Plan now is to make my own multiprocessing queue (which won't be networked). celery is just not working out for me.

#10

Updated by Rogers, Chris over 7 years ago

Sod it. I set the application launcher to just watch memory usage and then kill/restart any process that eats up more memory than it should. By default maximum memory usage is set to 15% of total.

My preference is still to use my own multiprocessing queue, probably using the existing MongoDB. But life is short... and this solution seems to work. Note that it will not be possible to do a load test against this as the offline/online stuff will be killed.

#11

Updated by Rogers, Chris over 7 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

Closed in r990

#12

Updated by Rajaram, Durga over 7 years ago

  • Target version changed from Future MAUS release to MAUS-v0.7.1

Also available in: Atom PDF