Re: #4667, Merge uses large amount of memory

From: Stefan Fuhrmann <stefan2_at_apache.org>
Date: Tue, 3 Jan 2017 21:30:18 +0100

On 03.01.2017 15:58, Julian Foad wrote:
> https://issues.apache.org/jira/browse/SVN-4667
>
> I am currently contracting for WANdisco to help a customer whose merge is using
> excessive RAM. The merge will not complete with 4 GB RAM and will complete with
> 5 GB RAM available.
>
> The branches involved have subtree mergeinfo on over 3500 files, each referring
> to about 350 branches on average, and just over 1 revision range on average per
> mergeinfo line. Average path length is under 100 bytes.

What is the result of 'svn pg "svn:mergeinfo" -R | wc -c'?

> This seems already far too much memory usage for the size of the data set, and
> the size of the data set is growing.
>
> Issue #4667 is about reducing the amount of RAM Subversion uses given this data
> set. Another way to approach the problem is to reduce the amount of subtree
> mergeinfo by changing the work flow practices; that approach is also being
> investigated but is not in the scope of this issue, except to note that the
> tools "svn-mergeinfo-normalizer" and "svn-clean-mergeinfo.pl" both also fail to
> execute in the available RAM.

You may run svn-mergeinfo-normalizer on arbitrary sub-trees.

A lot of memory will be used to hold that part of the repository
history that is relevant to the branches mentioned in the m/i.
This may easily grow to several GB if there have been tens of
millions of changes.

If the tool manages to read the mergeinfo, it will print m/i
stats before fetching the log. Does it get to this stage?

> The reproduction recipe I'm using so far is attached to the issue. It generates
> a repository with N=300 (for example) branches, each with a unique file changed,
> and merged to trunk such that trunk gets N files with subtree mergeinfo, each
> referring to up to N branches (half of N, on average).
>
> I can then run test merges, with debugging prints in them, to view the memory
> increase:
>
> # this runs a merge from trunk to branch,
> # with WC directory 'A' switched to a branch:
> $ (cd
> obj-dir/subversion/tests/cmdline/svn-test-work/working_copies/mergeinfo_tests-14/ &&
> \
> svn revert -q -R A/ && \
> svn merge -q ^/A A)
> DBG: merge.c:12587: using 8+3 MB; increase +2 MB
> DBG: merge.c:12418: using 8+25 MB; increase +21 MB
> DBG: merge.c:12455: using 8+34 MB; increase +9 MB
> DBG: merge.c:9378: using 8+37 MB; increase +3 MB
> DBG: merge.c:9378: using 8+43 MB; increase +6 MB
>
> I don't know how representative this repro-test is of the customer's use case,
> but it provides a starting point.
>
> Monitoring the memory usage (RSS on Linux) of the 'svn' process (see the issue
> for code used), I find:
>
> original: baseline 8 MB (after process started) + growth of 75 MB
> after r1776742: baseline 8 MB + growth of 50 MB
> after r1776788: baseline 8 MB + growth of 43 MB

I noticed that the w/c context object seems to use a fluctuating
amount of memory, raising the baseline closer to 16 MB. IOW, your
relative savings may actually be larger.

> Those two commits introduce subpools to discard temporary mergeinfo after use.
> There are no doubt more possibilities to tighten the memory usage using
> subpools. This approach might be very useful, but seems unlikely to deliver an
> order-of-magnitude or an order-of-complexity reduction that probably will be
> needed.
>
> I would like to try a different approach. We read, parse and store all the
> mergeinfo, whereas I believe our merge algorithm is only interested in the
> mergeinfo that refers to one of exactly two branches ('source' and 'target') in
> a typical merge. The algorithm never searches the 'graph' of merge ancestry
> beyond those two branches. We should be able to read, parse and store only the
> mergeinfo we need.

That seems to be the path to take. I would have assumed that we only
need the m/i for the source branch as the target m/i is implied as
being all of the target history.

> Another possible approach could be to store subtree mergeinfo in a "delta" form
> relative to a parent path's mergeinfo.

I can see two problems here. First, you can only use the new scheme
after all "relevant", i.e. merging, clients have been upgraded. More
importantly, the in-memory data model would need to be something
delta-like. That sounds like a lot of code-churn.

-- Stefan^2.
Received on 2017-01-03 21:30:26 CET

This message: [ Message body ]
Next message: Daniel Shahaf: "CVSSv2 → CVSSv3?"
Previous message: Julian Foad: "#4667, Merge uses large amount of memory"
In reply to: Julian Foad: "#4667, Merge uses large amount of memory"
Next in thread: Julian Foad: "Re: #4667, Merge uses large amount of memory"
Reply: Julian Foad: "Re: #4667, Merge uses large amount of memory"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]