Re: Changesets vs Wet Blankets

From: Tom Lord <lord_at_emf.net>
Date: 2003-04-15 20:12:47 CEST

> From: Greg Hudson <ghudson@MIT.EDU>
> On Mon, 2003-04-14 at 18:11, Tom Lord wrote:
> > I pointed out that it is not unreasonable to use project trees as the
> > granularity of atomic commits (e.g., for locking), and that it might
> > even be reasonable to tweak the UI semantics by asserting that
> > atomic commits are not guaranteed except _within_ individual project
> > trees (a tweak that would further simplify implementation)
>
> If commits are only atomic within a project, and each project has its
> own isolated changeset journal, and its own cache of skip-deltas, how
> does that differ from having each project in its own repository?

In some practical ways it does differ, and in some important semantic
ways it does not.

As a practical matter: it's one repo to admin instead of N, and one
repo for users to point their clients at instead of N.

As a semantic matter: Right! The repository boundaries don't matter
so much when you don't have global serialization. And that's "1/4" of
what you need, semantically, to support seemless, distributed
branching. It gets you closer to the state of being able to say: "As
far as the client is concerned, it makes no difference at all whether
it's talking to 1 repository or 10 separately administered,
non-communicating repositories." The data model is naturally
parallel and That's a Good Thing.

> > If every commit to a journal has
> > to share a single lock just to begin to sort out conflicting
> > concurrent commits, that's an added implementation complexity.

> That's not added implementation complexity, because whether or
> not we have project trees, we have to consider the correctness
> of conflicting commits within a project. It doesn't help
> implementation complexity to divide the problem into simple
> cases and hard cases; you still have to solve the hard cases.

I have a two part answer for you -- one for each of the questions: Do
we really have to solve those hard problems if there's project tree
granularity? Assuming we do have to solve them, are they easier to
solve given project tree granularity? So:

Do we really have to solve the hard problem of detecting conflicting
concurrent commits within a given project tree development line? Not
obviously. If I have a repo with D development lines and an overall
commit rate C, then presumably my commit rate to each development line
is going to give some interesting distribution around C/D --- project
trees lower the probability of conflicting concurrent commits.
Suppose, then, that we impose a rule that commits can only take place
when the project tree wc is up-to-date (rather than just the files
being committed): first, that's not such a bad policy anyway, but
second, it's a cheap policy to implement.

But let's assume that answer doesn't fly. Assuming we do have to
solve the problem of sorting out conflicting from non-conflicting
roughly concurrent commits to a project tree, will that be easier to
do given project tree granularity? Well, yes it will -- simply
because the pre-sorting of journaled changes by project tree, and the
tractable size of project trees, mean that I can more often use "brute
force" techniques and still get decent performance. In this case, the
client has prepared a changeset that touches some subset of the
project tree. Between the client's last `update' and the time the txn
is to be processed, other changesets in the same line may have
appeared. But because of our project-tree orientation, those
intervening changesets are all in a nice simple list, and I can just
scan them quickly to get the set of changed objects within the project
tree, check for an intersection with the proposed commit, and proceed
accordingly. I can even reasonably do that set computation client
side, relying on the server only to be able to tell me what files/dirs
have changed in a given project-tree changeset.

>> Beyond locking, having the journal pre-sorted by project tree
>> development line means that for any query confined to a single
>> project tree, I can find all of the relevent journal entries in an
>> O(1) operation (just "look at the right part of the journal") rather
>> than having to search the journal or rely on an external index of
>> it.

> So, right now you can "svn log" a directory to find the commits
> which affect that directory or any file within it. Yes, we had
> to implement that, but we had to do that anyway, because you
> need to be able to "svn log" a file, with or without project
> directories.

Svn log can tell you which commits effected the tree -- but to
actually gather up the changes associated with those commits you have
to do that pruned-tree-walk, issuing separate queries for each changed
node in each commit. I suppose you could add yet another table.

> > An example of such a query is a client asking: "is my wc of
> > this project tree up to date?" -- that translates into "give me the
> > list of journal entries for this project tree".

> Again, we have to implement "is this file up to date?", and answering
> that question for a directory isn't any harder.

Ok, but now contrast gathering up the actual changes making the tree
out of date.

> > The skip delta cache can be a nicely orthogonal
> > component with a procedural API and an implementation whose storage
> > management issues are separate from everything else and nicely
> > modularized.

> We've talked about moving various things out of the database, such as
> the strings table, but we've never done it because having everything
> stored the same way simplifies backups. (The skip-delta cache may be
> "just a cache" logically, but if it takes several days to rebuild it,
> you have to back it up along with the changeset log.)

I don't know why you'd want to backup or rebuild a cache -- you'd just
populate it on-demand. Acesss patterns to historic revisions is going
to vary wildly over time. There's no good reason to maintain and
back-up a skip-delta record that optimizes all access equally. Just
the opposite: space-bound the skip-delta cache and let it optimize in
response to use.

> > Bringing project trees back in: they simplify things by allowing you
> > to compute skip deltas for project trees, rather than individual
> > nodes. If you index a skip-delta cache by project tree, you'll have
> > N compartments of deltas; if you index by node or node/copy_id,
> > you'll have N * K where, typically, K is in O([100...5000]).

> You'll only have 100 to 5000 times as many compartments if each
> changeset affects 100 to 5000 nodes.

You're counting skip-delta datums, I'm counting skip-delta indexes.

> > If a query needs deltas that pertain to a particular file within a
> > project tree, for example, a trivial linear search of some
> > delta-chain for that project tree is a practical solution -- given
> > first-class project trees.

> I don't think that's really practical for a project like gcc. You'd be
> greatly expanding the amount of data you have to look at to get at a
> particular file at some revision. Having small compartments does have
> constant-factor penalties, but having large compartments is worse
> because it changes the performance curve.

GCC, a very busy project with O(100) authorized committers, gets less
than 70 commits a day during the busy periods (quite a bit less --
that's a very conservative estimate with a safety margin thrown in).
Nearly all of those commits modify just a very few files. So if each
changeset contains a list of things modified, those lists are going to
tend to be very short. So if you scan those indexes for a given day,
that's just a few hundred lines of data. Over a third of a year?
O(10,000) lines of data. So to answer a query such as we're talking
about for a 100 day period of time, it's comperable to grepping a file
or files adding up to O(10,000) lines. Brute force will do the job
quite nicely.

> Also, if you did project tree skip-deltas, then for reasonable
> efficiency, checking out a project tree would no longer be a
> file-by-file operation, which would seem to complicate restartable
> checkouts.

Not really but, more to the point, simple client-side caching
mechanisms would solve the same problem as restartable checkouts very
simply. Project tree changesets simplify the implementation of such
caches by providing a convenient and small set of keys to index the
cache by.

> And checking out a subdirectory of a project efficiently
> sounds tricky.

Perhaps. Supporting that certainly adds code to the system and I'm
not too sure about the cost/benefit trade-off there. A few times a
year I find myself wanting to check out just the doc/ directory from
somebodies CVS repo -- but in day to day development, project tree
checkouts dominate by far.

> > someone or other said (and I'm paraphrasing
> > here, but just barely) "svn is _useless_ [for revision control]
> > without that additional structure".

> But at least one other person said they don't use that structure. And
> there are people who want to use svn as a backing store for DAV
> filesystems using auto-versioning, a use case which has no need for the
> trunk/branches/tags structure.

Which is another example of why I think reconceptualizing the product
as a txnal fs db and a layer over that to do revision control makes sense.

> > Beyond that -- while you may choose not to see svn's txnal fs
> > scaling that way, I think you have to anticipate txnal fs' coming
> > along that _do_ scale that way in the not too distant future.

> Speaking as someone who has to make loads of free software work with a
> network filesystem, I can tell you authoritatively that the world does
> not give a damn about network filesystems, by and large.

While a txnal fs db may be presented as a network filesystem, that's
hardly it's most interesting property. Rather, I think it's
interesting as a data store with ACID txns and a data model that
subsumes RDBMSs and OODBs into the more flexible and easier to work
with model of a unix-ish fs.

> > In reply (not just from Brane), I was told that I was "backsliding"
> > (?!?) and accused of trying to (paraphrased) "Go over the heads of
> > developers."

> I didn't mean it as an accusation. (I've never quite understood why
> "going over someone's head" is supposed to be a bad thing.) It just
> didn't seem like it was going to work.

Ok. "Going over someone's head" is an invocation of authority, hence
force. It is an attempt to override someone's volition. It's an
I-win-you-lose scenario.

Authority exists, and there's no easy getting around that. But I like
to conceive of many forms of civil authority as the back-up system --
the fall-back device. In the ordinary course of events, authority
should be mostly ceremonial, or at the level of who gets to make the
final guess when reason fails the group -- but otherwise irrelevant.
Layered over authority, there's volition and cooperation and the
development of common interests -- and those factors should dominate.

I said "CollabNet" and you heard "The Man" -- and I think that's
unsurprising but unfortunate.

> > To be slightly more explicit: on the one hand, I have a bunch of
> > technology in arch that, as is starting to become clear in various
> > branches of this discussion, is applicable to svn.

> From my corner, that only seems to be becoming clearer to you.

Well, to site a small example: I thought you did cop to getting some
benefit from my original talking about app-level journalling instead
of BDB-logs -- and that's lifted right out of arch and passed through
some filters to express it as it applies with fewest contingencies to
the svn world.

> (Not that you have to convince me. I don't work for CollabNet
> and I'm not in a position to increase your income.)

Perhaps not, but you do seem to have a grip on the design space for
storge managers and are interesting to chat with.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Apr 15 20:02:49 2003

This message: [ Message body ]
Next message: Greg Hudson: "Re: ra_svn failure over ssh on release branch."
Previous message: Philip Martin: "Re: ra_svn failure over ssh on release branch."
In reply to: Greg Hudson: "Re: Changesets vs Wet Blankets"
Next in thread: Greg Hudson: "Re: Changesets vs Wet Blankets"
Reply: Greg Hudson: "Re: Changesets vs Wet Blankets"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]