Re: RFC: Delta indexing and composition

From: <kfogel_at_collab.net>
Date: 2001-07-30 18:25:09 CEST

Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu> writes:
> The scheme we have in place now works (kudos to Mike and Karl here!),
> but wastes both time and space, and is probably a mild disaster for
> non-sequential access to large files or old versions. The reasons for
> this are obvious:
>
> 1. It reads/resurrects the whole text from the beginning up to the
> end of the interesting region
> 2. It keeps applying all the deltas from the version to the fulltext
> every time an old version is accessed, throwing away the result.

Absolutely. The current under-the-hood reconstruction code sucks.

(But it's not quite as bad as described above, I believe: it only
reconstructs starting with a window known to be relevant -- windows
before and after that are ignored and never read, respectively,
right? Still, it performs like a jar of mud right now.)

> My suggested solution to these two problems is based on recognizing
> exactly what a delta window is:
>
> A delta window defines a contiguous region of text A. It may depend
> on a contiguous region of text B, but is independent of any other
> delta window in the delta representation of A.
>
> Given this definition, two things become obvious:
>
> 1. To reconstruct the region defined by window N, we only have to
> read that window.
> 2. Different windows within a delta may depend on different source texts.
>
> The first item implies that we can solve the first problem my indexing
> the windows within a delta, and accessing them directly. But we knew
> that already.
>
> The second item implies that every time we reconstruct a region of text,
> we can replace its defining delta window with a single diff from the
> fulltext, eliminating the intermediate reconstruction steps the next
> time this region is accessed -- thus solving the second problem.
>
>
> So, here's my proposal
>
> 1) Change the delta representation to index and store delta windows
> separately
>
> DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;
> WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ;
> OFFSET ::= number ;
> REP-OFFSET ::= number;
>
>
> The REP-KEY and REP-OFFSET in WINDOW are optional because, if the
> differences between two file revisions is large enough, the diff could
> in fact be larger than a compression-only vdelta of the text region. In
> that case it makes more sense to compress the window than to store a diff.
>
>
> 2) Change the undeltifier to use the new structure
>
> The undeltifier will stay essentially the same as it is now, except that
> it will use OFFSET and REP-OFFSET to access the necessary bits directly.
> The place where the delta combiner will fit stays the same, too.
>
> The major addition comes after the text is reconstructed. Using some
> suitable heuristic -- probably based on the number of jumps from the
> representation to the fulltext, the size of the diff from the fulltext,
> etc. -- we can decide to: a) replace the window with a single diff from
> the fulltext, b) replace it with a compressed version of the region, or
> c) do nothing.
>
>
>
> The disadvantage of this proposal is, of course, more space used in the
> repository. We can reduce the increase somewhat by compressing the
> window index, and possibly improving the svndiff encoding. But I think
> it's a fair price to pay, because this scheme reduces the number of disk
> accesses, total memory use and average processing time needed to
> reconstruct a region of text.
>
>
>
> That's it. Let's see you find holes in my proposal. :-)

I think it looks good. The increase in repository space used is
slight -- it's a small, constant per-window overhead, which isn't so
bad, even if strictly speaking it's proportional to the size of the
data.

I presume the OFFSET in "(OFFSET WINDOW)" is offset in to the
reconstructed fulltext, and that the series of OFFSETs seen in

DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ;

must be in increasing order, so that there is a reasonable efficient
way to track down the window(s) one needs for a given range? Might it
not be better to give the full range reconstructed by each window, as
in

DELTA ::= (("delta" FLAG ...) (OFFSET LEN WINDOW) ...) ;

or even

DELTA ::= (("delta" FLAG ...) (RANGE-START RANGE-END WINDOW) ...) ;

That way, you'd be able to find out the total size of a reconstructed
text without actually having to reconstruct it. Well, perhaps if
that's the problem, it would be simpler just to stick with your
original scheme and add one more piece of data:

DELTA ::= (("delta" FLAG ...) RECONSTRUCTED-SIZE (OFFSET WINDOW) ...) ;

Anyway:

The main concern is implementation schedule right now, of course.
Branko, it would be awesome if you could make this change, but would
you have time to do it before August 15th? It's okay if you don't;
just check in your design to the notes/ directory somewhere, and Mike
and/or I will get to it.

Semi-related issue:

The other thing I've been thinking about is planning ahead for storing
large files outside the repository db tables... or, perhaps, storing
all "real" data outside the tables? That is, perhaps we should
reserve db tables for metadata, and store file contents (and prop
lists? and dir entries lists?) out in the fs, using the usual sort of
directory/file hash scheme that one works up for such circumstances.

So instead of a `strings' table, there would be a new subdir in the
repository, "strings/" or whatever, and everything that now lives in
the `strings' table would live there instead. Or, alternatively, file
contents would live out there, but dir entries and prop lists might
still live in a `strings' table...

The motivation for this is the presumed greater efficiency of reaching
far into a file in the filesystem as opposed to a value in a Berkeley
DB table. Does someone here know these tradeoffs fairly well?

Note that this stuff doesn't necessarily need to be finalized before
M3. We can change the repository format even after we start
self-hosting, although it might be a mild pain to do so if it means
exporting and re-importing old data (or losing history) when we
upgrade. Anyway, the point is we shouldn't put off self-hosting even
if we think the storage layout might change.

Thoughts?,

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:33 2006

This message: [ Message body ]
Next message: Kevin Pilch-Bisson: "Re: Java Library"
Previous message: kfogel_at_collab.net: "Re: Intro and questions"
Maybe in reply to: Branko Èibej: "RFC: Delta indexing and composition"
Next in thread: Paul D. Smith: "Re: RFC: Delta indexing and composition"
Reply: Paul D. Smith: "Re: RFC: Delta indexing and composition"
Reply: Branko Èibej: "Re: RFC: Delta indexing and composition"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]