I've been poking around on the fs-rep-sharing branch lately to try and

determine how we can avoid circular deltas when reusing representations.

For those unfamiliar with the problem, here's a brief synopsis.

(Apologies if this sounds like a core dump from my brain...)

The purpose of fs-rep-sharing is to eliminate storing duplicate

representations in the fs. Instead of just writing a representation to

the database, we first check a cache of checksum->representation-key

mappings, and see if we can reuse an existing representation. This is

currently implemented for Berkeley DB.

Because the representations themselves are mutable in BDB (i.e., the

same representation can be full-text, and later rewritten as a delta

against another full text), a problem arises where we can get circular

delta references. Consider the following:

+-----+

| |

| |

| |

+-----+

A_1

A_1 is the first representation of the object, so we store the full text.

+-----+ +-----+

| | delta | |

| | ----> | |

| | | |

+-----+ +-----+

A_1 A_2

A_2 is a subsequent representation of the object, so we store its full

text, and store A_1 as a delta against it. Now, imagine we revert back

to the previous text in a later revision. Without rep-sharing we get:

+-----+ +-----+ +-----+

| | delta | | delta | |

| | ----> | | ----> | |

| | | | | |

+-----+ +-----+ +-----+

A_1 A_2 A_3

A_3 is now stored full text and A_2 is a delta against A_3, with A_1

being unchanged. However, with rep-sharing, we get:

+-------------------------+

v |

+-----+ +-----+ |

| | delta | | delta +-+

| | ----> | | ----> |o|

| | | | +-+

+-----+ +-----+

A_1 A_2 A_3

Where both A_1 and A_2 remain unchanged, and A_3, instead of being a new

representation, just refers to the previously written identical

representation, A_1. This turns out to be a problem, because A_3 is

really just A_1, which is a delta against A_2, which is a delta against

A_1, which is a delta against A_2, ad nauseam.

See the problem?

The representations don't even have to be this close to each other. In

other words, as long as two representations in the same "delta chain" as

the same, we've got the problem:

+-------------------------------------------------+

v |

+-----+ +-----+ +-----+ |

| | delta | | delta delta | | delta +-+

| | ----> | | ----> ... ----> | | ----> |o|

| | | | | | +-+

+-----+ +-----+ +-----+

A_1 A_2 A_n-1 A_n

The solution, would be to *not* rewrite A_n-1 in terms of A_n, but just

create the link to A_1:

+---------------------------------------------+

v |

+-----+ +-----+ +-----+ |

| | delta | | delta delta | | +-+

| | ----> | | ----> ... ----> | | |o|

| | | | | | +-+

+-----+ +-----+ +-----+

A_1 A_2 A_n-1 A_n

This breaks the "delta chain", with A_n-1 remaining a full text

representation, against which the relevant deltas could apply. An

interesting side effect is what happens on the subsequent commit:

+---------------------------------------------+

v |

+-----+ +-----+ +-----+ | +-----+

| | delta | | delta delta | | +-+ delta | |

| | ----> | | ----> ... ----> | | |o| ----> | |

| | | | | | +-+ | |

+-----+ +-----+ +-----+ +-----+

A_1 A_2 A_n-1 A_n A_n+1

We represent A_n+1 as full text, and rewrite A_n as a delta against

that. But, because A_n is the same representation as A_1, we really get:

+---------------------------------------------+

v |

+-----+ +-----+ +-----+ | +-----+

| | | | delta delta | | +-+ | |

| |--+ | | ----> ... ----> | | |o| +-> | |

| | | | | | | +-+ | | |

+-----+ | +-----+ +-----+ | +-----+

A_1 | A_2 A_n-1 A_n | A_n+1

| delta |

+--------------------------------------------+

Which works out okay: A_1 still represents the same content, it is just

being expressed differently. In a complex repository, where there are

multiple links to the same representation, that representation would get

rewritten several times, but the content should *never* change.

So far, the solution looks tenable. The real problem is determining

when to not rewrite the previous representation as a delta against a

linked representation. In other words, we have to answer the question:

"Is rep A_1 somehow dependent upon A_n-1?" I don't know how to do this

in constant time, and I'm leery of running forward through the delta

chain to find the full text every time we find a duplicate

representation. Are there other alternatives?

(For the curious, we don't have this problem in FSFS. Because committed

representations are immutable in FSFS, the representation we are

referring to always has an appropriate full text for the delta, and we

know it won't change. The real barrier on FSFS is creating an efficient

on-disk cache for checksum->representation mappings. Hmmmm, SQLite,

anybody?)

-Hyrum

Received on 2008-03-27 16:45:42 CET