Project

General

Profile

Bug #1706

Memory leaks preventing GRID running

Added by Dobbs, Adam about 6 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Code Management
Target version:
Start date:
09 July 2015
Due date:
% Done:

100%

Estimated time:
Workflow:
New Issue

Description

The block on GRID running is now the memory leaks in MAUS. Track progress here.


Files


Related issues

Related to MAUS - Bug #1704: Memory LeakClosedRogers, Chris03 July 2015

Actions
#1

Updated by Dobbs, Adam about 6 years ago

  • Target version set to MAUS-v0.9.8

Valgrind summary for run 07157.000, first 6 DAQ events. Command used:

valgrind --leak-check=full --track-origins=yes --log-file=valgrind_run7157_pid%p.txt python bin/analyze_data_offline.py --daq_data_path /vols/fets2/adobbs/data/07157/ --daq_data_file 07157.000  --simulation_geometry_filename "Stage4.dat" --SciFiDigitizationNPECut 2.0 --SciFiPRHelicalOn False --Number_of_DAQ_Events 6

When running with MAUS 0.9.7 get:

==6057== LEAK SUMMARY:
==6057==    definitely lost: 315,896 bytes in 2,418 blocks
==6057==    indirectly lost: 2,513,710 bytes in 5,016 blocks
==6057==      possibly lost: 15,989,178 bytes in 200,549 blocks
==6057==    still reachable: 197,706,302 bytes in 3,509,092 blocks
==6057==         suppressed: 0 bytes in 0 blocks
==6057== Reachable blocks (those to which a pointer was found) are not shown.
==6057== To see them, rerun with: --leak-check=full --show-reachable=yes
==6057== 
==6057== For counts of detected and suppressed errors, rerun with: -v
==6057== ERROR SUMMARY: 319862 errors from 6108 contexts (suppressed: 11258 from 31)

c.f. with Durga's mapcpp:

==5986== LEAK SUMMARY:
==5986==    definitely lost: 300,424 bytes in 1,667 blocks
==5986==    indirectly lost: 3,054,754 bytes in 8,762 blocks
==5986==      possibly lost: 15,772,745 bytes in 200,536 blocks
==5986==    still reachable: 196,675,895 bytes in 3,509,338 blocks
==5986==         suppressed: 0 bytes in 0 blocks
==5986== Reachable blocks (those to which a pointer was found) are not shown.
==5986== To see them, rerun with: --leak-check=full --show-reachable=yes
==5986== 
==5986== For counts of detected and suppressed errors, rerun with: -v
==5986== ERROR SUMMARY: 249101 errors from 5915 contexts (suppressed: 11817 from 31)

Performance very similar.

#2

Updated by Dobbs, Adam about 6 years ago

Full logs attached too. Trunk now being examined (contains Chris' fixes)...

#3

Updated by Dobbs, Adam about 6 years ago

... and using the trunk:

==12865== LEAK SUMMARY:
==12865==    definitely lost: 316,040 bytes in 2,407 blocks
==12865==    indirectly lost: 1,987,047 bytes in 4,353 blocks
==12865==      possibly lost: 15,738,687 bytes in 195,328 blocks
==12865==    still reachable: 196,581,186 bytes in 3,508,466 blocks
==12865==         suppressed: 0 bytes in 0 blocks
==12865== Reachable blocks (those to which a pointer was found) are not shown.
==12865== To see them, rerun with: --leak-check=full --show-reachable=yes
==12865== 
==12865== For counts of detected and suppressed errors, rerun with: -v
==12865== ERROR SUMMARY: 318199 errors from 6092 contexts (suppressed: 11093 from 31)
#4

Updated by Dobbs, Adam about 6 years ago

Update: Added debug flags to the build and redone valgrind job for the trunk (now have line numbers in the valgrind output).

#5

Updated by Dobbs, Adam about 6 years ago

Seeing some very unusual valgrind results for SciFi. The copy constructor of SciFiTrack is registering as causing definite memory leaks on the line:

  _trackpoints.resize(a_track._trackpoints.size());
  for (size_t i = 0; i < a_track._trackpoints.size(); ++i) {
    _trackpoints[i] = new SciFiTrackPoint(*a_track._trackpoints[i]);
  }

where a_track is the original track, i.e. a deep copy of the track points. Yet the destructor seems to take care of this:

SciFiTrack::~SciFiTrack() {
  // Delete track points in this track.
  std::vector<SciFiTrackPoint*>::iterator track_point;
  for (track_point = _trackpoints.begin();
       track_point!= _trackpoints.end(); ++track_point) {
    delete (*track_point);
  }
}

Implies to me either the destructor is not getting called somewhere else, or the trackpoints themselves have an issue.

Indeed valgrind does complain about the trackpoints in a few areas. First the default constructor on the line:

_cluster = new TRef();

which valgrind claims causes a definite loss. Yet the destructor looks fine for this:

SciFiTrackPoint::~SciFiTrackPoint() {
  delete _cluster;
}

Bit baffled on that one. The copy contructor too:

SciFiTrackPoint::SciFiTrackPoint(const SciFiTrackPoint &point) {
  _spill = point.spill();
  _event = point.event();

  _tracker = point.tracker();
  _station = point.station();
  _plane   = point.plane();
  _channel = point.channel();

  _chi2 = point.chi2();

  _pos = point.pos();
  _mom = point.mom();

  _covariance = point._covariance;
  _errors = point._errors;

  _pull              = point.pull();
  _residual          = point.residual();
  _smoothed_residual = point.smoothed_residual();

  _covariance = point.covariance();
  _cluster = new TRef(*point.get_cluster());
}

The last line shows as a possible loss, an indirect loss and a still reachable, and the first error and covariance setting lines (note the covariance is set twice, unnecessary).

Finally, the clusters themselves have one valgrind issue in the default constructor:

_digits = new TRefArray();

As before this looks like it should be taken care of by the destructor:

SciFiCluster::~SciFiCluster() {
  delete _digits;
}

Odd, it seems either valgrind is confused by TRefArrays, or we are.

Any thoughts? It is not apparent to me how to fix this.

#6

Updated by Dobbs, Adam about 6 years ago

Added Chris Hunt as a watcher. Chris H, could you take a look too? Thanks...

#7

Updated by Dobbs, Adam about 6 years ago

Plot attached of peak memory used by MAUS trunk (pre-speedup) as a function of total of number of DAQ events specified in the datacard (each data point represents a full input-merge-output-terminate cycle of MAUS).

#8

Updated by Dobbs, Adam about 6 years ago

Repeated the analysis using today's trunk with MAUS speedup added. Results attached (I believe I unhelpfully changed my unit of mem from MB (base 10) to MiB (base 2) between the two studies... makes about 5% difference). Speed up version looking a lot more predictable and performance seems improved over old MAUS. Get a nice linear relationship after a while giving ~205kiB memory loss per DAQ event. This is the figure we need to improve.

#9

Updated by Rajaram, Durga about 6 years ago

Reran valgrind and there does seem to be a leak from SciFiTrack as pointed above but I don't see anything obvious in the code. BTW there's a TRefArray related issue open #1663 -- related?

#10

Updated by Dobbs, Adam about 6 years ago

Possibly - it may be that the TRefArrays are not flushing, will try to look into it.

Have calculated how much memory MAUS uses as function of number of complete (i.e. approx 250MiB big) DAQ files, assuming conditions present in run 7157. Results attached. Get out a relationship:

mem_used[MiB] = 466.5[MiB] + 213.8*nfiles

Hence for run 7517, consisting of 28 files, we would need a little under 6.5 GiB of available RAM.

This implies to me if we switch our automated data processing to single powerful box, stuff it with say 16GiB or more RAM, we should be able to process our data even if we cannot fix any more memory leaks.

#11

Updated by Dobbs, Adam about 6 years ago

Alternatively if we split the processing up on the GRID to operate file by file, we should be able to run successfully on the GRID with the speedup version currently in the trunk (I'll put it in a release soon). Does this sound feasible?

NB: I still want to fix the memory leaks, it's just nice to know we have a plan B.

#12

Updated by Dobbs, Adam almost 5 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

All major leaks fixed, only small ones left, GRID up and running.

Also available in: Atom PDF