Re: [PATCH] Skip-deltas, for review

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-07-27 09:45:33 CEST

On Sat, 2002-07-27 at 00:42, William Uther wrote:
> The total length of arrow is the sum of the rows. There are log(N) rows,
> each with N/2 arrow length, giving O(N log(N)) total arrow length. That's a
> factor of log(N) more, not constant.

Oof, you're right. I'm afraid my space analysis was guided too much by
intuition (my intuitive reasoning went something like: like "half the
arrows are length two, a quarter are length four, an eighth are length
eight, and so it all averages out to two", which doesn't follow and
doesn't even stem from the right premise).

So, in the long run, the average delta crosses lg(N) node-revisions.
I'm really bothered by this result; lg(1024) is 10, after all, and I
don't want to expand people's repositories by an order of magnitude
unless they really want that. That's an absolute worst case, of course;
a delta across N revisions is generally much less than N times the delta
across one revision, because of reduced overhead and overlap between the
deltas. None of my tests, artificial or natural, saw anything
approaching a 2x increase in space, much less 10x. (The artificial test
with the single monotonically shrinking file saw a space increase of
20%; sorry that I mistakenly said 5-10% earlier.)

Checkpoint plaintexts or checkpoint-redeltification-aginst-head are
potentially much worse, space-wise; the space cost grows as O(N^2)
instead of O(N*lg(N)). Vdelta compression helps, but only by a constant
factor.

When it comes to commit time, I remain unconcerned:

> >>> Remember that the average slowdown in commit speed is O(log(log(N))
> >>> - we do log N re-deltifications every log N steps. That's REALLY
> >>> small.
> >
> > No, it's a constant factor.
>
> Not sure I believe you. But either way it is small.

The basic algorithm is: we do one redeltification all the time, another
one half the time, another one a quarter of the time, another one an
eighth of the time, and so on. That converges to 1+.5+.25+.125+...=2
redeltifications per node-revision, on average. It doesn't take
appreciably more time to redeltify across a large swath of
node-revisions than it does to redeltify across just one.

By skipping the second row, we subtract the second term of that sequence
and get 1.5 instead of 2. By not doing skip-deltas in the first 32
revisions, we make it take a while to approach that 1.5 factor. So we
don't get a lot of overhead up front, and by the time we do start seeing
overhead, we're also seeing significant benefits in our access speed.

We do walk an average of lg(N) revisions down the predecessor chain. I
don't think that's a significant factor, though, and my tests bear that
out.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jul 27 09:46:18 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.