[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Reverse-delta storage

From: Jim Blandy <jimb_at_zwingli.cygnus.com>
Date: 2001-04-19 20:52:40 CEST

Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu> writes:
> I'm a bit uneasy about one thing: in the "strings" table, there's no way
> to distinguist between a PROPLIST and file contents. This means we can't
> (consistently) check that a STRING-KEY refers to the right kind of data.
> Do we want to have such sanity checks? If yes, we might have two string
> tables instead, one for PROPLISTs and one for CONTENTS.

Well, keep in mind that you can't check the form of a stored proplist
until after you've done all your delta reconstruction. The stuff that
goes in the tables is usually in svndelta form, meaningless by
itself. So the arrangement we have now --- where *-table.c modules
check intra-record consistency --- is hard to apply here.

The closest we could come would be to have two separate tables,
`fulltexts' and `deltas'. We'd apply no consistency checks to
`fulltexts'. We'd verify that every value stored in `deltas' was in
valid svndiff form.

But checks for well-formed property lists and directory contents need
to be done at a higher level. It's 100% cool with me to introduce new
modules corresponding to new degrees of consistency we can ensure,
under the new structure.

One possibility would be to introduce a `string manager' module, whose
job is simply to store texts under keys --- just like a DB hash table.
Internally, it would use tricks like deltification to store them
densely. The string manager's interface would include functions to
hint about which keys probably have similar texts, which keys' texts
were likely to change soon, etc. But the data provided to those
functions would affect only the performance of the string manager, not
its correctness. This would use SHA or MD5 or whatever as a checksum
to ensure that the texts were being reconstructed correctly.

The code managing property lists, file contents, and directory
contents would simply use the string manager like an ordinary DB hash
table. Checking the format of property lists and directory contents
would be done at the higher level.

> As I mentioned before, I've been searching for articles about delta
> composition. It turns out I was a bit too optimistic about the
> results. It looks like we'll have to invent the wheel again, or at
> least spokes ... So, again, if anybody has any ideas, please yell.

I talked about this with Karl on the phone, and gave pretty bad
explanations of my thinking; I'll try to do better today. This will
also provide some background for other readers.

I'm told that RCS reconstructs older file revisions by walking the
list of diffs, ignoring the actual text, and simply recording the line
numbers and sizes of the substitutions. Then, it can run over that
data and do arithmetic to construct a single `composed' diff against
the youngest revision would reconstructs the older revision. Then,
you apply that composed diff to get the revision you wanted.

For example, if your fulltext is:

rev 3: abcdefghijklmnopqrst

and your deltas are:

rev 2: replace 3--6 with "howdy" (yielding abchowdyghijklmnopqrst)
rev 1: replace 6--8 with " are you" (yielding abchow are youghijklmnopqrst)

then the RCS algorithm would gather this info:

rev 2: replace 3--6 with 5 chars (call them X)
rev 1: replace 6--8 with 8 chars (call them Y)

Now, without looking at any of the text, it can compose the two deltas
to get the following delta, yielding rev 1:

        replace 3--6 with range 0--3 from X
        replace 6--6 with range 0--8 from Y (i.e. insert Y at 6)

If we then apply this `composed' delta to the original text, we get:

        abchow are youghijklmnopqrst

The point of all this is that you don't need to mess around with
actual text until the very end. Until that point, the amount of work
depends on the amount of change, not on the amount of text involved.
And when you do get around to actually assembling text, the amount of
work depends on the size of the output file --- because you're only
touching each piece of text once --- and only weakly on the amount of
change.

Our current svndiff format frustrates this somewhat by compressing the
new text, as well as pulling in pieces of the original. The
compression process produces a lot of delta ops that deal with small
pieces of text. (Or at least, I expect it does...) So even if the
change is something simple --- replacing a single block of text in the
middle of the file, say --- you end up with an svndiff with a lot of
ops, mostly concerned with building the replacement text from pieces
of itself. This is great, except that having lots of small ops
increases the amount of work the delta composition phase needs to do.
In fact, if the ops usually deal with really small pieces of text ---
a few dozen bytes or so --- I expect it'd be faster to just throw the
actual text around. Memcpy is pretty optimized on real systems; you
could copy a lot of bytes in the time it would take to do funky
intersections and adjustments on a list of ops talking about those
bytes, so those ops had better refer to large blocks of bytes.

I'm not sure what to do with that. It almost seems like we want the
text delta computation algorithm to optimize deltas for network
transmission and deltas for storage differently.
Received on Sat Oct 21 14:36:29 2006

This is an archived mail posted to the Subversion Dev mailing list.