Project

General

Profile

Bug #1489

Still got memory problems

Added by Rogers, Chris over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Data Structure
Target version:
Start date:
19 June 2014
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

Looks like the fixes done previously were not enough - we still are exceeding the memory limit (6GB) for reconstruction according to GRID folks. Better dust off valgrind... I added valgrind option to tests/integration/test_load/test_load.py and am running on the test server http://test.mice.rl.ac.uk/job/MAUS_rogers_load_tests/ (this is running against the trunk in fact).


Files

run_04000_resource_usage.png (6.74 KB) run_04000_resource_usage.png Rajaram, Durga, 16 July 2014 04:39
run_05679_resource_usage.png (6.82 KB) run_05679_resource_usage.png Rajaram, Durga, 16 July 2014 04:39
test_load_analyze_data_offline_05810.log (10.9 MB) test_load_analyze_data_offline_05810.log Rogers, Chris, 17 July 2014 05:25
consoleText (2.2 MB) consoleText Rogers, Chris, 17 July 2014 05:25
stripped_05810.log (4.63 MB) stripped_05810.log stripped out gumph, sorted for leak size (leakiest first) Rogers, Chris, 17 July 2014 06:09
valgrind.log (612 KB) valgrind.log Rogers, Chris, 24 July 2014 17:34
test_load_analyze_data_offline_05679.log (360 KB) test_load_analyze_data_offline_05679.log Rajaram, Durga, 31 July 2014 04:06
memory.log (38.7 KB) memory.log Rogers, Chris, 07 October 2014 06:12
#1

Updated by Rajaram, Durga about 7 years ago

If I'm interpreting the usage output from the tests on Run 4000, it looks like the memory usage is now ~1.5 GB

[{'%mem': '49.8', 'sz': '494163', 'vsz': '1976652', 'pid': 28108, '%cpu': '99.8', 'etime': '03:49:12', 'rsz': '1471184'}]

For the EMR run 5679, it's higher -- possibly indicating a leak in the EMR code -

7863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:40', 'rsz': '1921228'}]
[{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:48', 'rsz': '1921228'}]
[{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:51:56', 'rsz': '1921228'}]
[{'%mem': '65.0', 'sz': '607863', 'vsz': '2431452', 'pid': 24692, '%cpu': '99.5', 'etime': '04:52:05', 'rsz': '1921228'}]

On my machine, I saw ~1.3 GB of usage (RSZ) for Run 4000.

#2

Updated by Rogers, Chris about 7 years ago

The valgrind load tests were failing because I didn't have valgrind installed on rogers_sl64 box. I have now installed valgrind and I am rerunning with valgrind on. Following the comment above, I decided to run valgrind against 05810 - 11 MB of data, so not zero but not huge; and should have EMR stuff in there. Let's see if it runs!

#3

Updated by Rogers, Chris about 7 years ago

valgrind log file for run 05810 attached. Note the job ran for 8 hours and then was killed. This was run on the 64 bit rogers_sl64 test server from the command

./install_build_test.bash ${NODE_THIRD_PARTY}
echo "Run unit test shuffle" 
source ${WORKSPACE}/env.sh
echo "Run load tests" 
python ${MAUS_ROOT_DIR}/tests/integration/test_load/test_load.py --run-number=05810 --valgrind=1 --max-time=28800 --max-mem-size=320000
#5

Updated by Rogers, Chris about 7 years ago

I did some more digging on this. Looks like PyROOT does not respect the Python reference counting mechanism. There is supposed to be in PyROOT an option when wrapping real objects, as documented at the ROOTTalk:

http://root.cern.ch/phpBB3/viewtopic.php?f=14&t=15782

but as far as I can tell it is not working i.e. the MAUS::Data objects are not deleted when the reference count of the enclosing object reaches 0. No obvious reason why. I implemented a new python function, maus_cpp.converter.del_data_repr

    del_data_repr(...)
        Free memory owned by MAUS.Data
          - data_representation: MAUS.Data representation of spill data
        Returns None
        Raises TypeError if data_representation is of incorrect type

        Explicit free must be called on MICE data when in MAUS.Data
        representation as underlying ROOT library does not respect
        Python reference counting. Otherwise a memory leak will 
        ensue. Once freed, calling the referenced memory will result
        in a python crash. I'm very sorry about that, I can't see a
        better way.

Looks like this resolves most of the memory leak. The JsonCppStreamer is also leaking a MICEData, this must be a pre-existing leak. I will clean up and push the changes outlined in this tracker and see how we are doing. It is always dark when using ROOT...

#6

Updated by Rogers, Chris about 7 years ago

Note to self:

  • Need to fix some stuff in OutputCppRoot
  • Need to check multiprocessing routines - nothing was changed there yet
#7

Updated by Rogers, Chris about 7 years ago

I added some delete calls to OutputCppRoot. I think they are in the right place, it is hard to know - ROOT is not very good at communicating who owns what memory (so probably ROOT Devs do not know either).

All unit tests passed on the test server, some integration tests failed. I think the data structure is badly handling a delete call. I saw a stack trace like:

===========================================================
#5  0x00000000000003f1 in ?? ()
#6  0x00002b988a2622d1 in MAUS::DAQData::SetTriggerRequestArray ()
   from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so
#7  0x00002b988a263ace in MAUS::DAQData::~DAQData ()
   from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so
#8  0x00002b988a457218 in MAUS::Spill::~Spill ()
   from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so
#9  0x00002b988a44d528 in MAUS::Data::~Data ()
   from /scratch/jenkins/workspace/MAUS_rogers/build/libMausCpp.so
#10 0x00002b988c8d471f in MAUS::PyConverter::py_del_data_repr ()
   from /scratch/jenkins/workspace/MAUS_rogers/build/maus_cpp/converter.so
#11 0x00002b9878392eca in PyEval_EvalFrameEx (f=0x1494f4c0, 
<etc, python call stack>
#8

Updated by Rogers, Chris about 7 years ago

I attach a valgrind log. Let me make some explanation.

I put some code in the top level Data object constructor to print out - that Data was constructed or deleted; the address of the constructed Data; the total number of Data objects in memory; the stack trace of the Constructor or Delete call. diff below, in case that helps.

There are a whole load of stuff like

==12062== Invalid read of size 8
==12062==    at 0x10D4ED0C: MAUS::DAQData::SetTriggerRequestArray(std::vector<MAUS::TriggerRequest*, std::allocator<MAUS::TriggerRequest*> >) (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D4F28B: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D4F788: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D57567: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D576E8: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D60ACA: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D60B88: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x1A122EBB: MAUS::PyConverter::py_del_data_repr(_object*, _object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so)
==12062==    by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013)
==12062==    by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099)
==12062==    by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099)
==12062==    by 0x4F1FCE4: PyEval_EvalCodeEx (ceval.c:3253)
==12062==  Address 0x377918b0 is 0 bytes inside a block of size 32 free'd
==12062==    at 0x4C27FF2: operator delete(void*) (vg_replace_malloc.c:387)
==12062==    by 0x10D4ED11: MAUS::DAQData::SetTriggerRequestArray(std::vector<MAUS::TriggerRequest*, std::allocator<MAUS::TriggerRequest*> >) (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D4F28B: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D4F788: MAUS::DAQData::~DAQData() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D57567: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D576E8: MAUS::Spill::~Spill() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D60ACA: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x10D60B88: MAUS::Data::~Data() (in /home/cr67/MAUS/maus_merge/build/libMausCpp.so)
==12062==    by 0x1A129260: std::string* MAUS::PyObjectWrapper::unwrap<std::string>(_object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so)
==12062==    by 0x1A129B19: _object* MAUS::PyConverter::py_wrap<std::string>(_object*, _object*) (in /home/cr67/MAUS/maus_merge/build/maus_cpp/converter.so)
==12062==    by 0x4F1E51C: PyEval_EvalFrameEx (ceval.c:4013)
==12062==    by 0x4F1EEE9: PyEval_EvalFrameEx (ceval.c:4099)
==12062== 

I think that the MAUS::Data involved is okay (it doesn't look like a double free there) so probably there is some lurking wickedness in the copy constructor for Spill or DAQData. I shall dig tomorrow.

== modified file 'src/common_cpp/DataStructure/Data.cc'
--- src/common_cpp/DataStructure/Data.cc    2012-09-27 10:59:51 +0000
+++ src/common_cpp/DataStructure/Data.cc    2014-07-24 16:11:03 +0000
@@ -21,15 +21,20 @@

 namespace MAUS {

+int Data::_reference_count = 0;
+
 Data::Data() : MAUSEvent<Spill>("Spill"), _spill(NULL) {
+    IncreaseRefCount();
 }

 Data::Data(const Data& data) : MAUSEvent<Spill>(), _spill(NULL) {
+    IncreaseRefCount();
     *this = data;
 }

 Data::~Data() {
-    if (_spill != NULL) {
+   DecreaseRefCount();
+   if (_spill != NULL) {
         delete _spill;
         _spill = NULL;
     }
@@ -64,5 +69,17 @@
   Data spill;
   return sizeof(spill);
 }
+
+void Data::IncreaseRefCount() {
+    _reference_count++;
+    std::cerr << "CONSTRUCT " << this << " " << _reference_count << std::endl;
+    std::cerr << MAUS::Exception().MakeStackTrace(1) << std::endl;
+}
+
+void Data::DecreaseRefCount() {
+    _reference_count--;
+    std::cerr << "DESTRUCT " << this << " " << _reference_count << std::endl;
+    std::cerr << MAUS::Exception().MakeStackTrace(1) << std::endl;
+}
 }

=== modified file 'src/common_cpp/DataStructure/Data.hh'
--- src/common_cpp/DataStructure/Data.hh    2014-04-22 15:27:03 +0000
+++ src/common_cpp/DataStructure/Data.hh    2014-07-24 14:03:56 +0000
@@ -89,8 +89,11 @@
     void SetEvent(Spill* spill) {SetSpill(spill);}

   private:
+    void IncreaseRefCount();
+    void DecreaseRefCount();
     Spill* _spill;
     std::string _event_type;
+    static int _reference_count;
     MAUS_VERSIONED_CLASS_DEF(Data)
 };
 }
#9

Updated by Rajaram, Durga about 7 years ago

Thanks for the fixes. But it fails on EMR runs [ log attached test_load_analyze_data_offline_05679.log ]
Looks like it's in the EMREvent destructor, the pointer's already gone when it tries to delete.
Is there a double free going on with the last even, with both single_thread.py and MapPyGroup.py trying to delete the event twice?

#10

Updated by Rajaram, Durga about 7 years ago

Moved the EMREvent deletes from the destructor to a setter and pushed to the trunk.

Also, I realized that run 5810 used in rogers_load_tests is not an EMR run.

MAUS_data_load on test server run with 5468 [ EMR run ]

[{'%mem': '14.9', 'sz': '732637', 'vsz': '2930548', 'pid': 26653, '%cpu': '99.8', 'etime': '08:00:02', 'rsz': '2449036'}]

~ similar to a non-EMR run 5810

[{'%mem': '14.8', 'sz': '725051', 'vsz': '2900204', 'pid': 21252, '%cpu': '99.9', 'etime': '07:59:57', 'rsz': '2438812'}]

Both the above were run with valgrind which probably eats up some memory.

Run 4000, without valgrind is:

[{'%mem': '8.5', 'sz': '460679', 'vsz': '1842716', 'pid': 12080, '%cpu': '99.9', 'etime': '06:46:56', 'rsz': '1411780'}]

I've now modified the trunk to use a short EMR run [ 05466.001 ] as the default instead of the ancient 2873 that is the default in MAUS.
[ src/input/InputCppDAQData/05466.001 ] so the standard test suite should now catch failures from all the enabled detectors.

#11

Updated by Rogers, Chris almost 7 years ago

I had another look at this. E.g. run 04000 running against the trunk.

[{'%mem': '3.3', 'sz': '260621', 'vsz': '1042484', 'pid': 14158, '%cpu': '97.7', 'etime': '03:13', 'rsz': '548352'}]
... 4 hours later ...
[{'%mem': '4.2', 'sz': '294966', 'vsz': '1179864', 'pid': 14158, '%cpu': '99.8', 'etime': '04:44:17', 'rsz': '687640'}]

There is some 300 MB lost during this reconstruction, but that is not such a serious leak compared to the ~ 2000 MB program size. I also checked reference counting against spill by hard coding in a reference counter into spill constructor and destructor, results in memory.log - looks like the number of spill instances after running 100 spills is 1. So I think okay to close this issue. Any more memory leaks will have to be knocked off one by one.

#12

Updated by Rogers, Chris almost 7 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

Also available in: Atom PDF