> From: Greg Hudson <ghudson@MIT.EDU>
> On Mon, 2003-04-14 at 18:11, Tom Lord wrote:
> > I pointed out that it is not unreasonable to use project trees as the
> > granularity of atomic commits (e.g., for locking), and that it might
> > even be reasonable to tweak the UI semantics by asserting that
> > atomic commits are not guaranteed except _within_ individual project
> > trees (a tweak that would further simplify implementation)
> If commits are only atomic within a project, and each project has its
> own isolated changeset journal, and its own cache of skip-deltas, how
> does that differ from having each project in its own repository?
In some practical ways it does differ, and in some important semantic
ways it does not.
As a practical matter: it's one repo to admin instead of N, and one
repo for users to point their clients at instead of N.
As a semantic matter: Right! The repository boundaries don't matter
so much when you don't have global serialization. And that's "1/4" of
what you need, semantically, to support seemless, distributed
branching. It gets you closer to the state of being able to say: "As
far as the client is concerned, it makes no difference at all whether
it's talking to 1 repository or 10 separately administered,
non-communicating repositories." The data model is naturally
parallel and That's a Good Thing.
> > If every commit to a journal has
> > to share a single lock just to begin to sort out conflicting
> > concurrent commits, that's an added implementation complexity.
> That's not added implementation complexity, because whether or
> not we have project trees, we have to consider the correctness
> of conflicting commits within a project. It doesn't help
> implementation complexity to divide the problem into simple
> cases and hard cases; you still have to solve the hard cases.
I have a two part answer for you -- one for each of the questions: Do
we really have to solve those hard problems if there's project tree
granularity? Assuming we do have to solve them, are they easier to
solve given project tree granularity? So:
Do we really have to solve the hard problem of detecting conflicting
concurrent commits within a given project tree development line? Not
obviously. If I have a repo with D development lines and an overall
commit rate C, then presumably my commit rate to each development line
is going to give some interesting distribution around C/D --- project
trees lower the probability of conflicting concurrent commits.
Suppose, then, that we impose a rule that commits can only take place
when the project tree wc is up-to-date (rather than just the files
being committed): first, that's not such a bad policy anyway, but
second, it's a cheap policy to implement.
But let's assume that answer doesn't fly. Assuming we do have to
solve the problem of sorting out conflicting from non-conflicting
roughly concurrent commits to a project tree, will that be easier to
do given project tree granularity? Well, yes it will -- simply
because the pre-sorting of journaled changes by project tree, and the
tractable size of project trees, mean that I can more often use "brute
force" techniques and still get decent performance. In this case, the
client has prepared a changeset that touches some subset of the
project tree. Between the client's last `update' and the time the txn
is to be processed, other changesets in the same line may have
appeared. But because of our project-tree orientation, those
intervening changesets are all in a nice simple list, and I can just
scan them quickly to get the set of changed objects within the project
tree, check for an intersection with the proposed commit, and proceed
accordingly. I can even reasonably do that set computation client
side, relying on the server only to be able to tell me what files/dirs
have changed in a given project-tree changeset.
>> Beyond locking, having the journal pre-sorted by project tree
>> development line means that for any query confined to a single
>> project tree, I can find all of the relevent journal entries in an
>> O(1) operation (just "look at the right part of the journal") rather
>> than having to search the journal or rely on an external index of
> So, right now you can "svn log" a directory to find the commits
> which affect that directory or any file within it. Yes, we had
> to implement that, but we had to do that anyway, because you
> need to be able to "svn log" a file, with or without project
Svn log can tell you which commits effected the tree -- but to
actually gather up the changes associated with those commits you have
to do that pruned-tree-walk, issuing separate queries for each changed
node in each commit. I suppose you could add yet another table.
> > An example of such a query is a client asking: "is my wc of
> > this project tree up to date?" -- that translates into "give me the
> > list of journal entries for this project tree".
> Again, we have to implement "is this file up to date?", and answering
> that question for a directory isn't any harder.
Ok, but now contrast gathering up the actual changes making the tree
out of date.
> > The skip delta cache can be a nicely orthogonal
> > component with a procedural API and an implementation whose storage
> > management issues are separate from everything else and nicely
> > modularized.
> We've talked about moving various things out of the database, such as
> the strings table, but we've never done it because having everything
> stored the same way simplifies backups. (The skip-delta cache may be
> "just a cache" logically, but if it takes several days to rebuild it,
> you have to back it up along with the changeset log.)
I don't know why you'd want to backup or rebuild a cache -- you'd just
populate it on-demand. Acesss patterns to historic revisions is going
to vary wildly over time. There's no good reason to maintain and
back-up a skip-delta record that optimizes all access equally. Just
the opposite: space-bound the skip-delta cache and let it optimize in
response to use.
> > Bringing project trees back in: they simplify things by allowing you
> > to compute skip deltas for project trees, rather than individual
> > nodes. If you index a skip-delta cache by project tree, you'll have
> > N compartments of deltas; if you index by node or node/copy_id,
> > you'll have N * K where, typically, K is in O([100...5000]).
> You'll only have 100 to 5000 times as many compartments if each
> changeset affects 100 to 5000 nodes.
You're counting skip-delta datums, I'm counting skip-delta indexes.
> > If a query needs deltas that pertain to a particular file within a
> > project tree, for example, a trivial linear search of some
> > delta-chain for that project tree is a practical solution -- given
> > first-class project trees.
> I don't think that's really practical for a project like gcc. You'd be
> greatly expanding the amount of data you have to look at to get at a
> particular file at some revision. Having small compartments does have
> constant-factor penalties, but having large compartments is worse
> because it changes the performance curve.
GCC, a very busy project with O(100) authorized committers, gets less
than 70 commits a day during the busy periods (quite a bit less --
that's a very conservative estimate with a safety margin thrown in).
Nearly all of those commits modify just a very few files. So if each
changeset contains a list of things modified, those lists are going to
tend to be very short. So if you scan those indexes for a given day,
that's just a few hundred lines of data. Over a third of a year?
O(10,000) lines of data. So to answer a query such as we're talking
about for a 100 day period of time, it's comperable to grepping a file
or files adding up to O(10,000) lines. Brute force will do the job
> Also, if you did project tree skip-deltas, then for reasonable
> efficiency, checking out a project tree would no longer be a
> file-by-file operation, which would seem to complicate restartable
Not really but, more to the point, simple client-side caching
mechanisms would solve the same problem as restartable checkouts very
simply. Project tree changesets simplify the implementation of such
caches by providing a convenient and small set of keys to index the
> And checking out a subdirectory of a project efficiently
> sounds tricky.
Perhaps. Supporting that certainly adds code to the system and I'm
not too sure about the cost/benefit trade-off there. A few times a
year I find myself wanting to check out just the doc/ directory from
somebodies CVS repo -- but in day to day development, project tree
checkouts dominate by far.
> > someone or other said (and I'm paraphrasing
> > here, but just barely) "svn is _useless_ [for revision control]
> > without that additional structure".
> But at least one other person said they don't use that structure. And
> there are people who want to use svn as a backing store for DAV
> filesystems using auto-versioning, a use case which has no need for the
> trunk/branches/tags structure.
Which is another example of why I think reconceptualizing the product
as a txnal fs db and a layer over that to do revision control makes sense.
> > Beyond that -- while you may choose not to see svn's txnal fs
> > scaling that way, I think you have to anticipate txnal fs' coming
> > along that _do_ scale that way in the not too distant future.
> Speaking as someone who has to make loads of free software work with a
> network filesystem, I can tell you authoritatively that the world does
> not give a damn about network filesystems, by and large.
While a txnal fs db may be presented as a network filesystem, that's
hardly it's most interesting property. Rather, I think it's
interesting as a data store with ACID txns and a data model that
subsumes RDBMSs and OODBs into the more flexible and easier to work
with model of a unix-ish fs.
> > In reply (not just from Brane), I was told that I was "backsliding"
> > (?!?) and accused of trying to (paraphrased) "Go over the heads of
> > developers."
> I didn't mean it as an accusation. (I've never quite understood why
> "going over someone's head" is supposed to be a bad thing.) It just
> didn't seem like it was going to work.
Ok. "Going over someone's head" is an invocation of authority, hence
force. It is an attempt to override someone's volition. It's an
Authority exists, and there's no easy getting around that. But I like
to conceive of many forms of civil authority as the back-up system --
the fall-back device. In the ordinary course of events, authority
should be mostly ceremonial, or at the level of who gets to make the
final guess when reason fails the group -- but otherwise irrelevant.
Layered over authority, there's volition and cooperation and the
development of common interests -- and those factors should dominate.
I said "CollabNet" and you heard "The Man" -- and I think that's
unsurprising but unfortunate.
> > To be slightly more explicit: on the one hand, I have a bunch of
> > technology in arch that, as is starting to become clear in various
> > branches of this discussion, is applicable to svn.
> From my corner, that only seems to be becoming clearer to you.
Well, to site a small example: I thought you did cop to getting some
benefit from my original talking about app-level journalling instead
of BDB-logs -- and that's lifted right out of arch and passed through
some filters to express it as it applies with fewest contingencies to
the svn world.
> (Not that you have to convince me. I don't work for CollabNet
> and I'm not in a position to increase your income.)
Perhaps not, but you do seem to have a grip on the design space for
storge managers and are interesting to chat with.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Tue Apr 15 20:02:49 2003