Bug #1489
Still got memory problems
100%
Description
Looks like the fixes done previously were not enough - we still are exceeding the memory limit (6GB) for reconstruction according to GRID folks. Better dust off valgrind... I added valgrind option to tests/integration/test_load/test_load.py
and am running on the test server http://test.mice.rl.ac.uk/job/MAUS_rogers_load_tests/ (this is running against the trunk in fact).
Files
Updated by Rajaram, Durga about 9 years ago
- File run_04000_resource_usage.png run_04000_resource_usage.png added
- File run_05679_resource_usage.png run_05679_resource_usage.png added
If I'm interpreting the usage output from the tests on Run 4000, it looks like the memory usage is now ~1.5 GB
[{'%mem': '49.8', 'sz': '494163', 'vsz': '1976652', 'pid': 28108, '%cpu': '99.8', 'etime': '03:49:12', 'rsz': '1471184'}]
For the EMR run 5679, it's higher -- possibly indicating a leak in the EMR code -
7863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:40', 'rsz': '1921228'}] [{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:48', 'rsz': '1921228'}] [{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:56', 'rsz': '1921228'}] [{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:52:05', 'rsz': '1921228'}]
On my machine, I saw ~1.3 GB of usage (RSZ) for Run 4000.
Updated by Rogers, Chris about 9 years ago
The valgrind load tests were failing because I didn't have valgrind installed on rogers_sl64 box. I have now installed valgrind and I am rerunning with valgrind on. Following the comment above, I decided to run valgrind against 05810 - 11 MB of data, so not zero but not huge; and should have EMR stuff in there. Let's see if it runs!
Updated by Rogers, Chris about 9 years ago
- File test_load_analyze_data_offline_05810.log test_load_analyze_data_offline_05810.log added
- File consoleText consoleText added
valgrind log file for run 05810 attached. Note the job ran for 8 hours and then was killed. This was run on the 64 bit rogers_sl64 test server from the command
./install_build_test.bash ${NODE_THIRD_PARTY} echo "Run unit test shuffle" source ${WORKSPACE}/env.sh echo "Run load tests" python ${MAUS_ROOT_DIR}/tests/integration/test_load/test_load.py --run-number=05810 --valgrind=1 --max-time=28800 --max-mem-size=320000
Updated by Rogers, Chris about 9 years ago
I did some more digging on this. Looks like PyROOT does not respect the Python reference counting mechanism. There is supposed to be in PyROOT an option when wrapping real objects, as documented at the ROOTTalk:
http://root.cern.ch/phpBB3/viewtopic.php?f=14&t=15782
but as far as I can tell it is not working i.e. the MAUS::Data objects are not deleted when the reference count of the enclosing object reaches 0. No obvious reason why. I implemented a new python function, maus_cpp.converter.del_data_repr
del_data_repr(...) Free memory owned by MAUS.Data - data_representation: MAUS.Data representation of spill data Returns None Raises TypeError if data_representation is of incorrect type Explicit free must be called on MICE data when in MAUS.Data representation as underlying ROOT library does not respect Python reference counting. Otherwise a memory leak will ensue. Once freed, calling the referenced memory will result in a python crash. I'm very sorry about that, I can't see a better way.
Looks like this resolves most of the memory leak. The JsonCppStreamer is also leaking a MICEData, this must be a pre-existing leak. I will clean up and push the changes outlined in this tracker and see how we are doing. It is always dark when using ROOT...
Updated by Rogers, Chris about 9 years ago
Note to self:
- Need to fix some stuff in OutputCppRoot
- Need to check multiprocessing routines - nothing was changed there yet
Updated by Rogers, Chris about 9 years ago
I added some delete calls to OutputCppRoot. I think they are in the right place, it is hard to know - ROOT is not very good at communicating who owns what memory (so probably ROOT Devs do not know either).
All unit tests passed on the test server, some integration tests failed. I think the data structure is badly handling a delete call. I saw a stack trace like:
=========================================================== #5 0x00000000000003f1 in ?? () #6 0x00002b988a2622d1 in MAUS::DAQData::SetTriggerRequestArray () from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so #7 0x00002b988a263ace in MAUS::DAQData::~DAQData () from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so #8 0x00002b988a457218 in MAUS::Spill::~Spill () from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so #9 0x00002b988a44d528 in MAUS::Data::~Data () from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so #10 0x00002b988c8d471f in MAUS::PyConverter::py_del_data_repr () from /scratch/jenkins/workspace/MAUS_rogers/build/maus_cpp/converter.so #11 0x00002b9878392eca in PyEval_EvalFrameEx (f=0x1494f4c0, <etc, python call stack>
Updated by Rogers, Chris about 9 years ago
- File valgrind.log valgrind.log added
I attach a valgrind log. Let me make some explanation.
I put some code in the top level Data object constructor to print out - that Data was constructed or deleted; the address of the constructed Data; the total number of Data objects in memory; the stack trace of the Constructor or Delete call. diff below, in case that helps.
There are a whole load of stuff like
==12062== Invalid read of size 8 ==12062== at 0x10D4ED0C: MAUS::DAQData::SetTriggerRequestArray(std::vector<MAUS::TriggerRequest*, std::allocator<MAUS::TriggerRequest*> >) (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D4F28B: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D4F788: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D57567: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D576E8: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D60ACA: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D60B88: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x1A122EBB: MAUS::PyConverter::py_del_data_repr(_object*, _object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so) ==12062== by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013) ==12062== by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099) ==12062== by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099) ==12062== by 0x4F1FCE4: PyEval_EvalCodeEx (ceval.c:3253) ==12062== Address 0x377918b0 is 0 bytes inside a block of size 32 free'd ==12062== at 0x4C27FF2: operator delete(void*) (vg_replace_malloc.c:387) ==12062== by 0x10D4ED11: MAUS::DAQData::SetTriggerRequestArray(std::vector<MAUS::TriggerRequest*, std::allocator<MAUS::TriggerRequest*> >) (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D4F28B: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D4F788: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D57567: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D576E8: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D60ACA: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x10D60B88: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so) ==12062== by 0x1A129260: std::string* MAUS::PyObjectWrapper::unwrap<std::string>(_object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so) ==12062== by 0x1A129B19: _object* MAUS::PyConverter::py_wrap<std::string>(_object*, _object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so) ==12062== by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013) ==12062== by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099) ==12062==
I think that the MAUS::Data involved is okay (it doesn't look like a double free there) so probably there is some lurking wickedness in the copy constructor for Spill or DAQData. I shall dig tomorrow.
== modified file 'src/common_cpp/DataStructure/Data.cc' --- src/common_cpp/DataStructure/Data.cc 2012-09-27 10:59:51 +0000 +++ src/common_cpp/DataStructure/Data.cc 2014-07-24 16:11:03 +0000 @@ -21,15 +21,20 @@ namespace MAUS { +int Data::_reference_count = 0; + Data::Data() : MAUSEvent<Spill>("Spill"), _spill(NULL) { + IncreaseRefCount(); } Data::Data(const Data& data) : MAUSEvent<Spill>(), _spill(NULL) { + IncreaseRefCount(); *this = data; } Data::~Data() { - if (_spill != NULL) { + DecreaseRefCount(); + if (_spill != NULL) { delete _spill; _spill = NULL; } @@ -64,5 +69,17 @@ Data spill; return sizeof(spill); } + +void Data::IncreaseRefCount() { + _reference_count++; + std::cerr << "CONSTRUCT " << this << " " << _reference_count << std::endl; + std::cerr << MAUS::Exception().MakeStackTrace(1) << std::endl; +} + +void Data::DecreaseRefCount() { + _reference_count--; + std::cerr << "DESTRUCT " << this << " " << _reference_count << std::endl; + std::cerr << MAUS::Exception().MakeStackTrace(1) << std::endl; +} } === modified file 'src/common_cpp/DataStructure/Data.hh' --- src/common_cpp/DataStructure/Data.hh 2014-04-22 15:27:03 +0000 +++ src/common_cpp/DataStructure/Data.hh 2014-07-24 14:03:56 +0000 @@ -89,8 +89,11 @@ void SetEvent(Spill* spill) {SetSpill(spill);} private: + void IncreaseRefCount(); + void DecreaseRefCount(); Spill* _spill; std::string _event_type; + static int _reference_count; MAUS_VERSIONED_CLASS_DEF(Data) }; }
Updated by Rajaram, Durga about 9 years ago
Thanks for the fixes. But it fails on EMR runs [ log attached test_load_analyze_data_offline_05679.log ]
Looks like it's in the EMREvent destructor, the pointer's already gone when it tries to delete.
Is there a double free going on with the last even, with both single_thread.py and MapPyGroup.py trying to delete the event twice?
Updated by Rajaram, Durga about 9 years ago
Moved the EMREvent deletes from the destructor to a setter and pushed to the trunk.
Also, I realized that run 5810 used in rogers_load_tests is not an EMR run.
MAUS_data_load on test server run with 5468 [ EMR run ]
[{'%mem': '14.9', 'sz': '732637', 'vsz': '2930548', 'pid': 26653, '%cpu': '99.8', 'etime': '08:00:02', 'rsz': '2449036'}]
~ similar to a non-EMR run 5810
[{'%mem': '14.8', 'sz': '725051', 'vsz': '2900204', 'pid': 21252, '%cpu': '99.9', 'etime': '07:59:57', 'rsz': '2438812'}]
Both the above were run with valgrind which probably eats up some memory.
Run 4000, without valgrind is:
[{'%mem': '8.5', 'sz': '460679', 'vsz': '1842716', 'pid': 12080, '%cpu': '99.9', 'etime': '06:46:56', 'rsz': '1411780'}]
I've now modified the trunk to use a short EMR run [ 05466.001 ] as the default instead of the ancient 2873 that is the default in MAUS.
[ src/input/InputCppDAQData/05466.001 ] so the standard test suite should now catch failures from all the enabled detectors.
Updated by Rogers, Chris almost 9 years ago
- File memory.log memory.log added
I had another look at this. E.g. run 04000 running against the trunk.
[{'%mem': '3.3', 'sz': '260621', 'vsz': '1042484', 'pid': 14158, '%cpu': '97.7', 'etime': '03:13', 'rsz': '548352'}] ... 4 hours later ... [{'%mem': '4.2', 'sz': '294966', 'vsz': '1179864', 'pid': 14158, '%cpu': '99.8', 'etime': '04:44:17', 'rsz': '687640'}]
There is some 300 MB lost during this reconstruction, but that is not such a serious leak compared to the ~ 2000 MB program size. I also checked reference counting against spill by hard coding in a reference counter into spill constructor and destructor, results in memory.log - looks like the number of spill instances after running 100 spills is 1. So I think okay to close this issue. Any more memory leaks will have to be knocked off one by one.
Updated by Rogers, Chris almost 9 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100