[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: A forward-thinking thought about deltas in FS design

From: Tom Lord <lord_at_emf.net>
Date: 2003-07-06 18:18:22 CEST

> From: Greg Hudson <ghudson@MIT.EDU>
> > Once upon a time, back in the mid-80s I worked on a commercial VC tool
> > called SVM (Seidl Version Manager). Where were one of the few VC tools
> > for the PC (other main ones being PVCS, TLIB, and a relative newcomer
> > named MKS ;-). Ours was the only one of that bunch that used
> > forward-deltas. We overcame the performance issue by keeping a cache
> > of the most recent N versions in a cache area (where N could be
> > configured per system, as could the set of files for which to keep the
> > cache - if I recall correctly anyway :)

> Caching fails to ensure guaranteed acceptable performance, which I think
> is an important property of an FS back end.

_That_ kind of caching fails to ensure the performance guarantee you
are worried about, but related techniques do not.

Have you considered changing the top row of your diagram:

  0 <- 1 2 <- 3 4 <- 5 6 <- 7 8 <- 9 10 <- 11 12 <- 13 14 <- 15
  0 <---- 2 4 <---- 6 8 <----- 10 12 <----- 14
  0 <------------ 4 8 <--------------- 12
  0 <---------------------------- 8

so that it is:

  0 <- 1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 7 <- 8 <- 9 <- 10 <- 11 <- 12 ....

The space requirement for that row would be comperable to, but most
likely smaller than the sum of the sizes of the arrows in your
diagram. (Because a skip-delta will tend to be larger than the first
single-step arrow it summarizes. Indeed, "deep" skip-deltas (0<--8
being the deepest in your diagram), on busy files, will tend to be as
large as a full-text copy! A lot of "shallow" skip-deltas will tend
to be twice the size of the next-shallowest (logical) delta for the
same revision.)

So if you cached the other arrows in your diagram, and the cache is
given roughly the same amount of disk space as is needed for all the
arrows in your diagram, then the cache will function as a memo, and
you'll have the same performance guarantees that you have now.
(And admins could make a fine grain space/time trade-off from there.)

So, naively, it sounds like (worst case) O(2x) the total disk space
(compared to your approach), a best case (1x) the space, with
sufficient caching/memoizing to yield the same performance. But I
wonder if other factors and benefits wouldn't (a) reduce the space in
the direction of the "best case" scenario; (b) actually yield _better_
performance; (c) make the caching approach winning anyway.

For example, as with your approach, this could easily (dramatically)
reduce the growth rate of write-ahead logs (which also contributes to
write-txn throughput). The caching/memoizing approach would make it
easier, I think, to be computing skip-deltas concurrently and
opportunistically. Read-only mirrors and clients could be building
their own caches. Given the kinds of access needed to "row 0",
perhaps it could be stored more compactly and updated more quickly. A
cache could easily revert to full-texts when skip-deltas become
irrational. In a long-lived, busy archive, I'm not so sure the
performance guarantee you are after is really worth more than a
guarantee of _better_ performance where the access patterns are hot --
and the caching approach can make that trade-off smoothly (if it takes
me 3x as long to retrieve a 7-year old kernel that nobody else has
fetched for years, but 1/2 as long to get last months ....). The
simpler row-0 representation, being the only txnal data needed, can
help stabilize the implementation of the storage manager and simplify
its design. The cache implementation could be made independent of the
primary storage manager internals -- making it easier to write new
storage manager implementations (and new caching-engine
implementations).

In short, when you're looking at Big-O space and time complexities, it
comes out the same whether the skip-deltas are part of the primary
data store, or part of external caches and memos. It's all the
secondary effects of the external approach that can make it the more
winning technique.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Jul 6 18:16:43 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.