Bug #1328
Memory leak in MAUS online recon
100%
Description
top shows
15467 mice 15 0 2511m 1.5g 47m R 54.5 40.1 80:04.60 python
ps -eF shows
$> ps -eF | grep 15467 mice 15467 15458 79 642874 1620976 3 12:15 pts/1 01:20:22 python /home/mice/MAUS/.maus_release/bin/online/analyze_data_online_input_transform.py -mongodb_database_name=maus-new -type_of_dataflow=multi_process_input_transform -verbose_level=0 -DAQ_hostname=miceraid5
Looks like we have a leak...
Files
Related issues
Updated by Rogers, Chris over 10 years ago
- File mem_leak.log mem_leak.log added
Running test script against trunk yields following leaks from MAUS namespace (full valgrind log attached)
==28447== Conditional jump or move depends on uninitialised value(s) ==28447== at 0x10E3FA11: MAUS::VirtualPlaneManager::ConstructVirtualPlanes(BTField const*, MiceModule*) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==28447== by 0x10E393F4: MAUS::MAUSGeant4Manager::MAUSGeant4Manager() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==28447== by 0x10E39530: MAUS::MAUSGeant4Manager::GetInstance() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==28447== by 0x10E6A3DD: MAUS::GlobalsManager::InitialiseGlobals(std::string) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==28447== by 0x1077F1B4: MAUS::PyGlobals::Birth(_object*, _object*) (in /home/cr67/MAUS/maus_online_recon_test_thing/build/maus_cpp/globals.so) ==28447== by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013) ==28447== by 0x4F1FCE4: PyEval_EvalCodeEx (ceval.c:3253) ==28447== by 0x4EA35EB: function_call (funcobject.c:526) ==28447== by 0x4E7B7F2: PyObject_Call (abstract.c:2529) ==28447== by 0x4E88F4E: instancemethod_call (classobject.c:2578) ==28447== by 0x4E7B7F2: PyObject_Call (abstract.c:2529) ==28447== by 0x4F18666: PyEval_CallObjectWithKeywords (ceval.c:3882) ==29734== 8 bytes in 1 blocks are possibly lost in loss record 8,130 of 68,074 ==29734== at 0x4C28B35: operator new(unsigned long) (vg_replace_malloc.c:261) ==29734== by 0x14CF1D05: std::vector<G4VProcess*, std::allocator<G4VProcess*> >::_M_insert_aux(__gnu_cxx::__normal_iterator<G4VProcess**, std::vector<G4VProcess*, std::allocator<G4VProcess*> > >, G4VProcess* const&) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so) ==29734== by 0x14CEF3D8: G4ProcessManager::InsertAt(int, G4VProcess*, int) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so) ==29734== by 0x14CF0579: G4ProcessManager::SetProcessOrderingToFirst(G4VProcess*, G4ProcessVectorDoItIndex) (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4procman.so) ==29734== by 0x14F30A8F: G4VUserPhysicsList::AddTransportation() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so) ==29734== by 0x14F2FA74: G4VModularPhysicsList::ConstructProcess() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so) ==29734== by 0x10E353F0: MAUS::MAUSPhysicsList::ConstructProcess() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==29734== by 0x14F25DC8: G4RunManagerKernel::InitializePhysics() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so) ==29734== by 0x14F1F480: G4RunManager::InitializePhysics() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so) ==29734== by 0x14F1F6BF: G4RunManager::Initialize() (in /home/cr67/MAUS/maus_online_recon_test_thing/third_party/build/geant4.9.2.p04/lib/Linux-g++/libG4run.so) ==29734== by 0x10E393FD: MAUS::MAUSGeant4Manager::MAUSGeant4Manager() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so) ==29734== by 0x10E39530: MAUS::MAUSGeant4Manager::GetInstance() (in /home/cr67/MAUS/maus_online_recon_test_thing/build/libMausCpp.so)
Lots of memory leaks in G4 apparently - looks like some sloppiness in the way Rogers runs the Globals... what a moron
Updated by Rogers, Chris over 10 years ago
I have cleaned up G4 memory usage and this will be merged shortly. However, on second look it appears that the mongodb stuff is not being properly deleted. I have simulated the data taking for the running. My aim is to try to simulate online reconstruction for the entire weekend's running, then set this up as a regression test/load test.
Updated by Rogers, Chris over 10 years ago
Going through test/integration/test_analyze_online_reconstruction/test_analyze_online_reconstruction.py
using dataset consisting of all data from 4956-4997.
Process flow looks like:
- <head node> take input
- <head node> hand input to celery workers
- <celery nodes> do reconstruction
- <head node> poll celery workers for completed reconstructions
- <head node> hand completed reconstruction to mongodb
Try:
- Break program flow at 2 (commenting
framework.input_transform.submit_spill_to_celery
) removes the memory leak - Disable 3 (commenting mappers in
analyze_data_online_input_transform.py
) leaves the memory leak (expected, this stops reconstruction happening on child process - Break program flow at 4 (commenting while loop in
framework.input_transform.poll_celery_tasks
) removes the memory leak - Break program flow at 5 (commenting
self.doc_store.put(...)
) leaves memory leak in
Updated by Rogers, Chris over 10 years ago
Nailed this down to line 154 of input_transform.py
self.celery_tasks.pop(current)
Why is that leaking? self.celery_tasks
is a list. It doesn't leak if I don't pop
the list...
Updated by Rogers, Chris over 10 years ago
It looks like reconstruction algorithm returns two variables, failed() and successful(). If successful was false and failed was true the reconstruction never clears task from the queue. For some reason quite a bit of stuff is failing in this manner. So the memory leak can be fixed, but I would like to know what these two flags actually mean...
Updated by Rogers, Chris over 10 years ago
I successfully ran through the whole weekends worth of data taking. Observations:
- I checked roughly half way through the data taking. The processes looked to have leaked a little - they were up from 1.5% of mem usage to maybe 2%.
- Mongodb was running at 389% of cpu usage! I was running on an Intel 4/8 core machine.
- Log files report
Spills input: 98174 Processed: 98830 Failed 28277
- bear in mind that this number includesstart_of_burst
andend_of_burst
i.e. a lot of jobs are failing
I attach full logs and end of run histos.
Updated by Rogers, Chris about 10 years ago
Okay, so this bug is now fixed. It turns out that if the result is not success or fail it is pending. I now handle that case properly I think. I am once again getting lots of these bloody
too much data for sort() with no index.
errors from Mongo-sodding-DB.
Updated by Rogers, Chris about 10 years ago
I did the sort (by reconstructed timestamp) in native python rather than using MongoDB calls and now it looks like MongoDB is more stable. I also add tests for most of merge_output.py (but not all, I am running out of time before upcoming user run). Looks like it is no longer leaking right now...
Updated by Rogers, Chris about 10 years ago
This is looking darker and darker. It looks like a bug in celery stack. I tried updating to a new version of celery and now I am getting MySQL error (but I never use MySQL anywhere, though DAQ does surely new version of celery shouldn't screw that up?).
Plan now is to make my own multiprocessing queue (which won't be networked). celery is just not working out for me.
Updated by Rogers, Chris about 10 years ago
Sod it. I set the application launcher to just watch memory usage and then kill/restart any process that eats up more memory than it should. By default maximum memory usage is set to 15% of total.
My preference is still to use my own multiprocessing queue, probably using the existing MongoDB. But life is short... and this solution seems to work. Note that it will not be possible to do a load test against this as the offline/online stuff will be killed.
Updated by Rogers, Chris about 10 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
Closed in r990
Updated by Rajaram, Durga about 10 years ago
- Target version changed from Future MAUS release to MAUS-v0.7.1