[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Rep sharing and circular deltas

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: Thu, 27 Mar 2008 15:26:47 -0500

Karl Fogel wrote:
> "Hyrum K. Wright" <hyrum_wright_at_mail.utexas.edu> writes:
>> I've been poking around on the fs-rep-sharing branch lately to try and
>> determine how we can avoid circular deltas when reusing
>> representations. For those unfamiliar with the problem, here's a brief
>> synopsis. (Apologies if this sounds like a core dump from my brain...)
>>
>> The purpose of fs-rep-sharing is to eliminate storing duplicate
>> representations in the fs. Instead of just writing a representation
>> to the database, we first check a cache of
>> checksum->representation-key mappings, and see if we can reuse an
>> existing representation. This is currently implemented for Berkeley
>> DB.
>>
>> Because the representations themselves are mutable in BDB (i.e., the
>> same representation can be full-text, and later rewritten as a delta
>> against another full text), a problem arises where we can get circular
>> delta references. Consider the following:
>
> Sanity check: are you using the word "representation" for what the
> repository (at least BDB) code calls "strings"? Because there's
> something else in there called "representations", but they're not
> deltified -- they're just a level of indirection between the filesystem
> and the file contents.
>
> I'll assume you're talking about "strings" from here on.

No, I'm pretty sure I was talking about "representations". :) Strings
are just blobs of bits, which have no knowledge of their relationships
to other strings, their checksums, deltas, etc. Representations
reference strings, but include other information so that a full text of
the representation can be reconstructed. Currently, we're trying to
allow multiple node-revisions to refer to the same representation, not
multiple representations to refer to the same string.

I think the rest of your analysis holds, though.

>> +-----+
>> | |
>> | |
>> | |
>> +-----+
>> A_1
>>
>> A_1 is the first representation of the object, so we store the full text.
>>
>> +-----+ +-----+
>> | | delta | |
>> | | ----> | |
>> | | | |
>> +-----+ +-----+
>> A_1 A_2
>>
>> A_2 is a subsequent representation of the object, so we store its full
>> text, and store A_1 as a delta against it. Now, imagine we revert back
>> to the previous text in a later revision. Without rep-sharing we get:
>>
>> +-----+ +-----+ +-----+
>> | | delta | | delta | |
>> | | ----> | | ----> | |
>> | | | | | |
>> +-----+ +-----+ +-----+
>> A_1 A_2 A_3
>>
>> A_3 is now stored full text and A_2 is a delta against A_3, with A_1
>> being unchanged. However, with rep-sharing, we get:
>>
>> +-------------------------+
>> v |
>> +-----+ +-----+ |
>> | | delta | | delta +-+
>> | | ----> | | ----> |o|
>> | | | | +-+
>> +-----+ +-----+
>> A_1 A_2 A_3
>>
>> Where both A_1 and A_2 remain unchanged, and A_3, instead of being a new
>> representation, just refers to the previously written identical
>> representation, A_1. This turns out to be a problem, because A_3 is
>> really just A_1, which is a delta against A_2, which is a delta against
>> A_1, which is a delta against A_2, ad nauseam.
>>
>> See the problem?
>
> Well, no, wait. Under a sharing scheme, if when you add a string you
> discover it has the same (fulltext) checksum as some existing string,
> then no new deltification happens -- instead, the new string is just a
> reference to the existing string. No cycle danger here.
>
> In other words, both A1 and A3 would be deltas against A2.
>
> I didn't read the rest of the mail closely, because it seemed to be
> solving a problem whose existence I'm denying here :-).

That makes sense. It seems to be an issue with the current
implementation where deltification of the previous representation
happens before we know if there is an existing representation for the
content we are interested in. We'd need to we order those steps, so
that the first can be conditional on the second.

Also, now that I think about it, does this break the "most recent
node-revision is full text" characteristic of the BDB backend? Does it
matter?

-Hyrum

Received on 2008-03-27 21:27:09 CET

This is an archived mail posted to the Subversion Dev mailing list.