Re: Whither delta combiner

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-02-10 00:01:02 CET

On Fri, 2002-02-08 at 23:29, Karl Fogel wrote:
> Both proposals trade delta space for faster retrieval of old
> revisions; haven't done a formal proof, but I'm pretty sure both
> methods end up being exactly the same tradeoff, and I think it's about
> the same as the skiplist tradeoff, too. (I haven't thought as much
> about the skiplists, so this is just an off-the-cuff reaction.)

I don't think so. Let's say we have a node with N node-revisions (no
branching), whose average plaintext size is S (after compression), and
for which the average delta size between two adjacent revisions is D.
For simplicity, we'll assume that a delta between revs X and Y is
D*|X-Y|, which is true for something like a ChangeLog file which mostly
just has data added to it in each rev. For such a file, S should grow
as ND.

Let's assume we want at most lg(N) deltas to get to any revision, since
that's what the simplest case of skiplist deltas gets us. Then, the
space costs are:

  Checkpoint plaintexts: O(DN^2/lg(N))
  Re-deltifying: O(DN^2/lg(N))
  Skiplist deltas: O(DNlg(N))

To plug in some vaguely believable numbers, let's say we have a
ChangeLog file with 8192 revisions, an average delta size of 256 bytes,
and an average plaintext size of one meg (which is 8192*256/2).
Plaintext deltas or re-deltifying would cost on the order of 630MB,
while skiplist deltas would cost on the order of 26MB or so.

For those with both skepticism and patience, here is how I get those
orders of magnitude:

Checkpoint plaintexts: We need N/lg(N) plaintexts, of average size S,
plus N-N/lg(N) adjacent-revision deltas. That's SN/lg(N) +
D(N-N/lg(N)), the second term of which is insignificant. Since S grows
as ND, we get DN^2/lg(N).

Re-deltifying: Re-deltifying is fundamentally the same as plaintext
deltas for a file like this; we just diff every lg(N) revisions against
the head instead of against the empty rev 0, effectively. Same order of
growth.

Skiplist deltas: The bottom row of N adjacent-rev deltas takes D*N
space, of course. The next row up also takes D*N space; there are half
as many deltas but they are each twice as big. And so on; since there
are lg(N) rows, we need DNlg(N) space.

Some might wonder about files which have lots of revs but don't behave
like ChangeLog files, such that a delta across many revs is only about
as big as a delta across one rev. In this case, checkpoint plaintexts
will do terribly (order SN/lg(N)), redeltification against the head will
do very well (order DN), and skiplist deltas will do almost as well
(order DN with a larger constant).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:06 2006

This message: [ Message body ]
Next message: Sean Russell: "Re: Patch command execution"
Previous message: Daniel Berlin: "Re: Whither delta combiner"
In reply to: Karl Fogel: "Re: Whither delta combiner"
Next in thread: Daniel Berlin: "Re: Whither delta combiner"
Reply: Daniel Berlin: "Re: Whither delta combiner"
Reply: Karl Fogel: "Re: Whither delta combiner"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]