Re: Whither delta combiner

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2002-02-11 18:32:20 CET

Greg Hudson <ghudson@MIT.EDU> writes:
> Skiplist deltas: The bottom row of N adjacent-rev deltas takes D*N
> space, of course. The next row up also takes D*N space; there are half
> as many deltas but they are each twice as big. And so on; since there
> are lg(N) rows, we need DNlg(N) space.

It's the part about there being lg(N) rows that puzzles me. Sorry to
be so dense, I feel as though I must be missing something here...

Earlier, you wrote

> Skiplist deltas, you mean? I'll give a simple description. Let's say
> we have a node with eight node-revisions (and no branching), and we only
> want to have one plaintext of that file--the most recent. Right now, I
> assume the node looks something like this (where "1" should really be
> "N.1" where N is the node number, but I'll abbreviate):
>
> 1 <-- 2 <-- 3 <-- 4 <-- 5 <-- 6 <-- 7 <-- 8
>
> To get the plaintext of rev 1, we need to apply seven deltas. With
> skiplist deltas, the node would look like:
>
> 4 <-------------------- 8
> 2 <-------- 4 <-------- 6 <-------- 8
> 1 <-- 2 <-- 3 <-- 4 <-- 5 <-- 6 <-- 7 <-- 8
>
> Now to get the plaintext of rev 1, we need to apply only three deltas (8
> -> 4 -> 2 -> 1), and there shouldn't be more than five deltas between
> any two revs (no more than four, looks like).

If the number of rows were truly lg(N), then the minimum number of
delta applications required to travel back a distance X would grow
without bound... Oh, wait, I see it! You jump from row to row,
narrowing down to the granularity you need at any given point. D'oh,
I knew that. :-)

Is it that we force the number of skiplist rows per node to be lg(N),
and then within each row, space the skips evenly (or as evenly as
integrally possible), such that the skip distance in each row is half
or twice that of the rows above and below, respectively? Of course,
one needs to make sure that the "destination" nodes coincide as often
as possible from row to row.

Okay, with my head cleared, I reread your explanation below and it
makes sense, even though somehow counterintuitive to me. Thanks.

The practical question then is, how would we change the repository
format to store skiplist deltas?

-K

Greg Hudson <ghudson@MIT.EDU> writes:
> On Fri, 2002-02-08 at 23:29, Karl Fogel wrote:
> > Both proposals trade delta space for faster retrieval of old
> > revisions; haven't done a formal proof, but I'm pretty sure both
> > methods end up being exactly the same tradeoff, and I think it's about
> > the same as the skiplist tradeoff, too. (I haven't thought as much
> > about the skiplists, so this is just an off-the-cuff reaction.)
>
> I don't think so. Let's say we have a node with N node-revisions (no
> branching), whose average plaintext size is S (after compression), and
> for which the average delta size between two adjacent revisions is D.
> For simplicity, we'll assume that a delta between revs X and Y is
> D*|X-Y|, which is true for something like a ChangeLog file which mostly
> just has data added to it in each rev. For such a file, S should grow
> as ND.
>
> Let's assume we want at most lg(N) deltas to get to any revision, since
> that's what the simplest case of skiplist deltas gets us. Then, the
> space costs are:
>
> Checkpoint plaintexts: O(DN^2/lg(N))
> Re-deltifying: O(DN^2/lg(N))
> Skiplist deltas: O(DNlg(N))
>
> To plug in some vaguely believable numbers, let's say we have a
> ChangeLog file with 8192 revisions, an average delta size of 256 bytes,
> and an average plaintext size of one meg (which is 8192*256/2).
> Plaintext deltas or re-deltifying would cost on the order of 630MB,
> while skiplist deltas would cost on the order of 26MB or so.
>
>
> For those with both skepticism and patience, here is how I get those
> orders of magnitude:
>
> Checkpoint plaintexts: We need N/lg(N) plaintexts, of average size S,
> plus N-N/lg(N) adjacent-revision deltas. That's SN/lg(N) +
> D(N-N/lg(N)), the second term of which is insignificant. Since S grows
> as ND, we get DN^2/lg(N).
>
> Re-deltifying: Re-deltifying is fundamentally the same as plaintext
> deltas for a file like this; we just diff every lg(N) revisions against
> the head instead of against the empty rev 0, effectively. Same order of
> growth.
>
> Skiplist deltas: The bottom row of N adjacent-rev deltas takes D*N
> space, of course. The next row up also takes D*N space; there are half
> as many deltas but they are each twice as big. And so on; since there
> are lg(N) rows, we need DNlg(N) space.
>
> Some might wonder about files which have lots of revs but don't behave
> like ChangeLog files, such that a delta across many revs is only about
> as big as a delta across one rev. In this case, checkpoint plaintexts
> will do terribly (order SN/lg(N)), redeltification against the head will
> do very well (order DN), and skiplist deltas will do almost as well
> (order DN with a larger constant).
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:06 2006

This message: [ Message body ]
Next message: Karl Fogel: "Re: Trace and cancel update patches part 3"
Previous message: Brian Behlendorf: "Re: [PATCH] Fix for "svn log URL""
In reply to: Greg Hudson: "Re: Whither delta combiner"
Next in thread: Daniel Berlin: "Re: Whither delta combiner"
Reply: Daniel Berlin: "Re: Whither delta combiner"
Reply: Greg Hudson: "Re: Whither delta combiner"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]