I am currently contracting for WANdisco to help a customer whose merge
is using excessive RAM. The merge will not complete with 4 GB RAM and
will complete with 5 GB RAM available.
The branches involved have subtree mergeinfo on over 3500 files, each
referring to about 350 branches on average, and just over 1 revision
range on average per mergeinfo line. Average path length is under 100 bytes.
This seems already far too much memory usage for the size of the data
set, and the size of the data set is growing.
Issue #4667 is about reducing the amount of RAM Subversion uses given
this data set. Another way to approach the problem is to reduce the
amount of subtree mergeinfo by changing the work flow practices; that
approach is also being investigated but is not in the scope of this
issue, except to note that the tools "svn-mergeinfo-normalizer" and
"svn-clean-mergeinfo.pl" both also fail to execute in the available RAM.
The reproduction recipe I'm using so far is attached to the issue. It
generates a repository with N=300 (for example) branches, each with a
unique file changed, and merged to trunk such that trunk gets N files
with subtree mergeinfo, each referring to up to N branches (half of N,
I can then run test merges, with debugging prints in them, to view the
# this runs a merge from trunk to branch,
# with WC directory 'A' switched to a branch:
svn revert -q -R A/ && \
svn merge -q ^/A A)
DBG: merge.c:12587: using 8+3 MB; increase +2 MB
DBG: merge.c:12418: using 8+25 MB; increase +21 MB
DBG: merge.c:12455: using 8+34 MB; increase +9 MB
DBG: merge.c:9378: using 8+37 MB; increase +3 MB
DBG: merge.c:9378: using 8+43 MB; increase +6 MB
I don't know how representative this repro-test is of the customer's use
case, but it provides a starting point.
Monitoring the memory usage (RSS on Linux) of the 'svn' process (see the
issue for code used), I find:
original: baseline 8 MB (after process started) + growth of 75 MB
after r1776742: baseline 8 MB + growth of 50 MB
after r1776788: baseline 8 MB + growth of 43 MB
Those two commits introduce subpools to discard temporary mergeinfo
after use. There are no doubt more possibilities to tighten the memory
usage using subpools. This approach might be very useful, but seems
unlikely to deliver an order-of-magnitude or an order-of-complexity
reduction that probably will be needed.
I would like to try a different approach. We read, parse and store all
the mergeinfo, whereas I believe our merge algorithm is only interested
in the mergeinfo that refers to one of exactly two branches ('source'
and 'target') in a typical merge. The algorithm never searches the
'graph' of merge ancestry beyond those two branches. We should be able
to read, parse and store only the mergeinfo we need.
Another possible approach could be to store subtree mergeinfo in a
"delta" form relative to a parent path's mergeinfo.
Received on 2017-01-03 15:58:39 CET