On Tue, 2006-07-25 at 12:54 +0100, Malcolm Rowe wrote:
> It describes in brief how Mercurial handles revision storage through its
> revlog structure and delta storage. I know that dberlin has been looking
> at using something like the former to reduce the number of files we open
> during FSFS read operations; and as Mercurial uses an algorithm called
> 'bdiff'[1] (based on Python's difflib algorithm) rather than xdelta,
> it might also be interesting to explore what kind of performance/storage
> results we'd get by doing the same.
I was *almost* cited in that paper! I'm listed as "unknown," although
since the cited URL is in the Subversion project repository, it would
have only taken a small amount of work to identify me as the author.
(To be fair, much more work than is required to identify the author of
the usual scholarly paper.)
When I orginally wrote FSFS, I concentrated on deployability in as many
environments as possible. By having one new file per revision, for
instance, it is possible to make the revisions directory insert-only in
some filesystems, which makes it impossible to tamper with old data.
For best performance, we want a substantially different approach.
Certainly, the data for a given file should be located in one contiguous
area, either in revlog-style or a weave. It should also be possible to
compress directories without seriously compromising performance, using a
btree with multiple roots. (Currently, in both back ends, we store every
revision of every directory separately and in full, so retrieval is fast
but space performance for large directories is abysmal.)
After reading the paper, I'm a little confused about how revlog works.
I believe it is using storing a flat (self-compressed) revision every
time the amount of patch data exceeds a certain multiple of the
compressed size of the last checkpoint revision. That's an interesting
take on checkpoint revisions. I originally discarded checkpoint
revisions because if you're naively storing, say, 10% of your revisions
flat, then your space cost is at least 10% of <total size of all
revisions>, which isn't a good order characteristic. Skip-deltas was
supposed to achieve similar access benefits (O(lg(n)) tends to behave
like O(1) in practice) while still having better space characteristics.
Revlog's space-sensitive checkpointing may achieve similar space
efficiency without requiring as much seeking around.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jul 25 18:17:36 2006