Re: mem bug progress: got the cause, fix planned

From: <kfogel_at_collab.net>
Date: 2001-08-24 17:27:26 CEST

I wrote:
> A quick status report on the showstopper "cannot allocate memory" bug:
> we're pretty sure we've got the cause, and a fix is known.

Make that: "we're sure".

This morning I ran one more test to verify the theory. The test was
to turn off deltification (and hence undeltification) and run
`mass-commit' again. This time it ran through all 1000 commits, no
memory problems, an order of magnitude faster than it did before.

Of course, the repository was still 86 megs even after pruning log
files, but hey, what did you expect? :-)

The fix for this particular problem is a rewrite of the commit code
path on the server side, described earlier, which will result in
dramatically more efficient commits independently of any improvements
we make in the undeltification code. However, we still obviously need
to improve undeltification, to retrieve old revisions efficiently, so
that's one of the top five after M3; see issue #414.

Also, note that eleven of the commits fail client-side due to issue
#461, which is a known bug. #461 is on the post-M3 list, but it won't
stop anyone from doing development or anything.

-K

> The fix is a fairly deep reworking of how up-to-date checks are done
> during commits, and may take a few days to implement. Fortunately,
> the benefit will be not merely that this bug goes away, but that
> commits become *much* more efficient than we ever thought they would
> be.
>
> The long story:
>
> (Apologies if this explanation needs more editing, I want to send this
> asap so people know what's going on.)
>
> Greg Stein and Ben and I have been investigating a bug whereby the fs
> runs out of memory when committing from a working copy that is largely
> far behind HEAD. This is the situation produced by the `mass-commit'
> script, which you may recall was posted here recently. The
> `mass-commit' script imports an entire Subversion source tree into a
> newly-minted repository, thus creating revision 1, and then checks it
> out into a working copy. It then runs through a cycle of 1000
> commits, randomly modifying a few files at various places in the wc
> tree and committing those files (by name) each time. It's only doing
> content changes, by the way, no property mods.
>
> It never updates that working copy, so all the directories remain at
> revision 1, and the files end up at all sorts of revisions, from 1 to
> around ~500 (on my box, it's commit 563 that finally runs out of
> memory, but Your Mileage May Vary, of course). This happens over
> *both* ra_local and ra_dav.
>
> We tried first the following experiment:
>
> Scenario 1 Scenario 2
> ------------------------ ---------------------------
> Try to commit those files Try to commit those files
> from a working copy that from a working copy that
> had never been updated, was just checked out from
> i.e., as the script does the repository, so the whole
> it. working copy is at HEAD.
>
> As you might expect, Scenario 1 fails with the out-of-memory error as
> it had always done, yet Scenario 2 succeeds instantly. Hmmm.
>
> Instrumentation revealed that the memory runs out after all the files'
> base revisions have been sent to the repository, but before the txn
> has been brought to a mergeable state. That is, the txn is
> constructed, and all the files being committed have been replaced
> using calls to svn_fs_link(), so that the txn now reflects an
> appropriate base state against which svndiff deltas can now be applied
> to reflect the working copy changes. But the changes haven't actually
> been applied yet, so every node revision in the txn still reflects a
> state that actually exists in a committed revision somewhere. And
> merge() hasn't been called yet, either, of course. With pool
> debugging -- thanks to Mike Pilato for the quick lesson -- we saw
> incredible numbers of pools (like 50,000) being created, cleared, and
> destroyed mostly in the svndiff'ing and undeltification code, all
> before txn_body_merge() was ever called.
>
> Remember this state; we'll be coming back to it. :-)
>
> Also, before we go on, note that this problem occurs with both
> ra_local and ra_dav, even though those two build their txns opposite
> ways: ra_dav builds it based on the head revision, whereas ra_local
> builds it based on the directory revision at the root of the change in
> the working copy. We thought for a long time that since our problem
> reproduced best with out-of-date working copies, that it must have
> something to do with the revision on which you base the txn, but
> noooo...
>
> The problem has to do with undeltification inefficiency compounded
> with the way file basetexts are obtained for svndiff applications.
> The former is a known problem which we are planning to address after
> M3, and luckily, it won't be necessary to solve it to make commits
> work today. It's the latter half that's the real issue. Here's an
> illustration of the problem:
>
> 1. You have a mixed-revision working copy, like the one produced by
> mass-commit. All directories are at rev 1, files are at various
> revisions.
>
> 2. You make many commits. No problem, though things seem to be
> getting a bit slow...
>
> 3. You commit a change to, among other things, the file
> `subversion/libsvn_fs/fs.c', so it ends up with revision 245.
>
> 4. You do many other commits. None of them touch `fs.c', but many
> of the commits do result in its parent, grandparent, or
> great-grandparent directory being "bubbled up" and receiving a
> new revision number in the repository.
>
> 5. You try to commit another change to `fs.c'. Now the head is at
> revision 558, and while `fs.c' has not changed since 245, its
> parents have changed many times...
>
> At this point the repository needs to check if your fs.c is
> up-to-date, and if it is, the fs needs to retrieve your revision of
> that file so the incoming svndiff can be applied to it.
>
> Actually, that's the problem. In order to even do the check, the fs
> *thinks* it needs revision 245 of the file, so it can compare that
> node id with the one in the head. But since the various parent
> directories have been changed a lot since then, fetching the old
> entry's node ID involves a *lot* of undeltification, which costs way
> too much right now, and frankly will never be truly cheap. Whenever
> we can avoid it, we should. At the moment, there's so much of it
> going on and it's so expensive that we actually run out of memory.
>
> There's another way to do things, fortunately. We have a magic "back
> door" to get what we need, without *ever* fetching that old directory
> listing. Every node remembers what revision it was created
> (committed) in. So we have these pieces of data:
>
> 1. The file's revision number on the client side.
>
> 2. A node revision for it in the repository head, cheaply
> obtainable because the head is fulltext, or still very near
> fulltext by the time you get a handle on it anyway.
>
> 3. The revision number in which the head revision of the file was
> created, via point (2).
>
> Thus we can use this new commit algorithm on the server-side, whose
> big advantage strength is that it avoids undeltifying numerous parent
> directories just to discover an old node-rev-id:
>
> 1. First of all, ra_local should base transaction on the youngest
> revision Y, as ra_dav does, not the revision of the working
> copy parent. This makes for cheaper merges and some more
> convenient code paths, although it doesn't directly solve this
> bug or anything.
>
> 2. For each committed target TGT at revision N that we receive:
>
> - Get the node-rev-id of TGT at revision Y. This is a
> fulltext retrieval, therefore cheap.
>
> - Look inside this node-rev-id, it will tell you what
> revision it was committed in. Call that revision L.
>
> - if (N < L)
>
> Then TGT at revision N is obviously out-of-date, because
> somebody changed it in revision L. Signal a conflict
> and bail early. Note that you never looked at the node
> ids themselves.
>
> else if (L <= N <= Y)
>
> Everything is fine; TGT at revision N is up-to-date,
> because we know that nobody has changed the node-rev-id
> between revisions L and Y. Drop Y's idea of TGT's
> node-rev-id into the transaction, and await a
> text-delta.
>
> else if (N > Y)
>
> A very rare situation, though possible if you really
> work at it (we know a scenario, but it's one that will
> "never" happen). Anyway, just bounce back with an
> out-of-date error, or else re-base the txn on the new
> youngest revision and redo the changes.
>
> That's the basic idea, though I've hand-waved on a few details --
> handling for added and deleted files, for example, plus you can see
> that some of the checks can be a bit fancier, and by remembering the
> node-rev we can do a predecessor/successor check as well, blah blah
> blah. Ben and I will be sitting down tomorrow morning and figuring
> out exactly what we'll need to change.
>
> -Karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:37 2006

This message: [ Message body ]
Next message: Kevin Pilch-Bisson: "Build System Problems!"
Previous message: peter.westlake_at_arm.com: "Re: libsvn_client checkout and multiple targets"
Maybe in reply to: kfogel_at_collab.net: "mem bug progress: got the cause, fix planned"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]