On Sat, 2002-07-27 at 00:42, William Uther wrote:

*> The total length of arrow is the sum of the rows. There are log(N) rows,
*

*> each with N/2 arrow length, giving O(N log(N)) total arrow length. That's a
*

*> factor of log(N) more, not constant.
*

Oof, you're right. I'm afraid my space analysis was guided too much by

intuition (my intuitive reasoning went something like: like "half the

arrows are length two, a quarter are length four, an eighth are length

eight, and so it all averages out to two", which doesn't follow and

doesn't even stem from the right premise).

So, in the long run, the average delta crosses lg(N) node-revisions.

I'm really bothered by this result; lg(1024) is 10, after all, and I

don't want to expand people's repositories by an order of magnitude

unless they really want that. That's an absolute worst case, of course;

a delta across N revisions is generally much less than N times the delta

across one revision, because of reduced overhead and overlap between the

deltas. None of my tests, artificial or natural, saw anything

approaching a 2x increase in space, much less 10x. (The artificial test

with the single monotonically shrinking file saw a space increase of

20%; sorry that I mistakenly said 5-10% earlier.)

Checkpoint plaintexts or checkpoint-redeltification-aginst-head are

potentially much worse, space-wise; the space cost grows as O(N^2)

instead of O(N*lg(N)). Vdelta compression helps, but only by a constant

factor.

When it comes to commit time, I remain unconcerned:

*> >>> Remember that the average slowdown in commit speed is O(log(log(N))
*

*> >>> - we do log N re-deltifications every log N steps. That's REALLY
*

*> >>> small.
*

*> >
*

*> > No, it's a constant factor.
*

*>
*

*> Not sure I believe you. But either way it is small.
*

The basic algorithm is: we do one redeltification all the time, another

one half the time, another one a quarter of the time, another one an

eighth of the time, and so on. That converges to 1+.5+.25+.125+...=2

redeltifications per node-revision, on average. It doesn't take

appreciably more time to redeltify across a large swath of

node-revisions than it does to redeltify across just one.

By skipping the second row, we subtract the second term of that sequence

and get 1.5 instead of 2. By not doing skip-deltas in the first 32

revisions, we make it take a while to approach that 1.5 factor. So we

don't get a lot of overhead up front, and by the time we do start seeing

overhead, we're also seeing significant benefits in our access speed.

We do walk an average of lg(N) revisions down the predecessor chain. I

don't think that's a significant factor, though, and my tests bear that

out.

---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org

For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sat Jul 27 09:46:18 2002