Rep sharing and circular deltas

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: Thu, 27 Mar 2008 10:45:26 -0500

I've been poking around on the fs-rep-sharing branch lately to try and
determine how we can avoid circular deltas when reusing representations.
For those unfamiliar with the problem, here's a brief synopsis.
(Apologies if this sounds like a core dump from my brain...)

The purpose of fs-rep-sharing is to eliminate storing duplicate
representations in the fs. Instead of just writing a representation to
the database, we first check a cache of checksum->representation-key
mappings, and see if we can reuse an existing representation. This is
currently implemented for Berkeley DB.

Because the representations themselves are mutable in BDB (i.e., the
same representation can be full-text, and later rewritten as a delta
against another full text), a problem arises where we can get circular
delta references. Consider the following:

   +-----+
   | |
   | |
   | |
   +-----+
     A_1

A_1 is the first representation of the object, so we store the full text.

   +-----+ +-----+
   | | delta | |
   | | ----> | |
   | | | |
   +-----+ +-----+
     A_1 A_2

A_2 is a subsequent representation of the object, so we store its full
text, and store A_1 as a delta against it. Now, imagine we revert back
to the previous text in a later revision. Without rep-sharing we get:

   +-----+ +-----+ +-----+
   | | delta | | delta | |
   | | ----> | | ----> | |
   | | | | | |
   +-----+ +-----+ +-----+
     A_1 A_2 A_3

A_3 is now stored full text and A_2 is a delta against A_3, with A_1
being unchanged. However, with rep-sharing, we get:

      +-------------------------+
      v |
   +-----+ +-----+ |
   | | delta | | delta +-+
   | | ----> | | ----> |o|
   | | | | +-+
   +-----+ +-----+
     A_1 A_2 A_3

Where both A_1 and A_2 remain unchanged, and A_3, instead of being a new
representation, just refers to the previously written identical
representation, A_1. This turns out to be a problem, because A_3 is
really just A_1, which is a delta against A_2, which is a delta against
A_1, which is a delta against A_2, ad nauseam.

See the problem?

The representations don't even have to be this close to each other. In
other words, as long as two representations in the same "delta chain" as
the same, we've got the problem:

      +-------------------------------------------------+
      v |
   +-----+ +-----+ +-----+ |
   | | delta | | delta delta | | delta +-+
   | | ----> | | ----> ... ----> | | ----> |o|
   | | | | | | +-+
   +-----+ +-----+ +-----+
     A_1 A_2 A_n-1 A_n

The solution, would be to *not* rewrite A_n-1 in terms of A_n, but just
create the link to A_1:

      +---------------------------------------------+
      v |
   +-----+ +-----+ +-----+ |
   | | delta | | delta delta | | +-+
   | | ----> | | ----> ... ----> | | |o|
   | | | | | | +-+
   +-----+ +-----+ +-----+
     A_1 A_2 A_n-1 A_n

This breaks the "delta chain", with A_n-1 remaining a full text
representation, against which the relevant deltas could apply. An
interesting side effect is what happens on the subsequent commit:

      +---------------------------------------------+
      v |
   +-----+ +-----+ +-----+ | +-----+
   | | delta | | delta delta | | +-+ delta | |
   | | ----> | | ----> ... ----> | | |o| ----> | |
   | | | | | | +-+ | |
   +-----+ +-----+ +-----+ +-----+
     A_1 A_2 A_n-1 A_n A_n+1

We represent A_n+1 as full text, and rewrite A_n as a delta against
that. But, because A_n is the same representation as A_1, we really get:

      +---------------------------------------------+
      v |
   +-----+ +-----+ +-----+ | +-----+
   | | | | delta delta | | +-+ | |
   | |--+ | | ----> ... ----> | | |o| +-> | |
   | | | | | | | +-+ | | |
   +-----+ | +-----+ +-----+ | +-----+
     A_1 | A_2 A_n-1 A_n | A_n+1
            | delta |
            +--------------------------------------------+

Which works out okay: A_1 still represents the same content, it is just
being expressed differently. In a complex repository, where there are
multiple links to the same representation, that representation would get
rewritten several times, but the content should *never* change.

So far, the solution looks tenable. The real problem is determining
when to not rewrite the previous representation as a delta against a
linked representation. In other words, we have to answer the question:
"Is rep A_1 somehow dependent upon A_n-1?" I don't know how to do this
in constant time, and I'm leery of running forward through the delta
chain to find the full text every time we find a duplicate
representation. Are there other alternatives?

(For the curious, we don't have this problem in FSFS. Because committed
representations are immutable in FSFS, the representation we are
referring to always has an appropriate full text for the delta, and we
know it won't change. The real barrier on FSFS is creating an efficient
on-disk cache for checksum->representation mappings. Hmmmm, SQLite,
anybody?)

-Hyrum

application/pgp-signature attachment: OpenPGP digital signature

Received on 2008-03-27 16:45:42 CET

This message: [ Message body ]
Next message: Mark Phippard: "Re: RFC: svn mergeinfo improvements for 1.5"
Previous message: Paul Burba: "Re: RFC: svn mergeinfo improvements for 1.5"
Next in thread: David Glasser: "Re: Rep sharing and circular deltas"
Reply: David Glasser: "Re: Rep sharing and circular deltas"
Reply: Blair Zajac: "Re: Rep sharing and circular deltas"
Reply: Karl Fogel: "Re: Rep sharing and circular deltas"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]