Re: [PROPOSAL] Merging Improved

From: Tom Lord <lord_at_emf.net>
Date: 2003-04-14 07:36:07 CEST

Disordered replies to:

> From: Greg Hudson <ghudson@MIT.EDU>

> (Why would you need a separate list to assign the repository rev
> numbers? Presumably the list of changesets has an order, and
> that could correspond to the repository revisions.)

I'm thinking of an FS namespace containing N project trees, and then N
"lists of changesets" (aka write transaction journals) -- one per
project tree. Each list is totally ordered -- but without additional
structure, the set of changesets is not totally ordered.

You need that additional structure only to support the semantic of a
repository revision number --- which is part of why I think the idea
of a repository revision number is a gaff.

> [in several places: that helps throughput with multiple servers,
> but not otherwise. Multiple servers is a relatively
> unimportant case for us.]

A project tree partitioning provides a simple-to-implement, O(1) way
to separate concurrent commits to, for example, different branches or
different projects within the same repo. As such, it gives an easy
mechanism to achieve reasonable throughput in even a single-server
native-fs implementation. The thing you lose in a native-fs
implementation compared to BDB or RDBMS is the deadlock detection
built-in to those db mechanisms -- project-tree transaction
granularity means that you don't need that mechanism so much.

Now, to be sure, project trees and, gasp, planing to do away with
repository revision numbers do indeed make the model more easily
scalable. It'd be a forsightful move to make such plans. But those
aren't the focus of what I'm talking about.

>> The second observation is that a commit consists of generating
>> a changeset client side, sending it to the server, checking for
>> up-to-dateness, and assigning a repository revision number. An
>> application-level log of such txns, suitable to ensure ACID
>> properties, is essentially just a per-project-tree list of
>> those changesets -- a data structure that's fairly easy to
>> implement on a native-fs -- plus another list to assign the
>> repository rev numbers.

> That's a useful observation which could help us implement more
> efficient journaling than BDB gives us, as you've discussed in
> the past, but there's no reason we couldn't do that without
> project directories.

Indeed. The idea of making a journal of commit records the definitive
history of the repository and treating everything else as lazy
caching/memoizing/indexing can speed up/make more space-efficient not
just native-fs, but other storage managers as well.

>> The third observation is that the various performance characteristics
>> we want can be built on-top of that basic lists-of-changesets
>> structure by caching and memoization of data about various revs.

> I don't think you can get the theoretical performance curve of
> skip-deltas simply by wrapping a cache around a changeset journal.

Um.... why not exactly? A cache of skip-deltas....

>> But on what should we key those caches, indexes, and memos? The
>> project-tree boundaries, because of the tractable size of the trees
>> they contain and their relationship to the atomicity of commits, are
>> ideal.

> Why?

Tractable data set sizes; isolation of (nearly?) all txns.

> > Is that too brief?

> If you're suggesting keeping a fulltext cache of every N revisions of
> the repository, with hard-links between identical revs of files,

Not specifically, no. That's one technique to throw into
consideration.

For svn-like performance, I was thinking more of roughly a single
full-text, but with indexed changesets and cached skip-delta
changesets. (Arch revision libraries contain not just full-text, but
also the changesets and some rudementary indexes -- the plan here (for
arch) is to beef up that indexing a bit and rely less on full-texts --
to treat the evolved rev library as more like a cache and less like a
memo).

> then
> you're probably not aiming for the same performance characteristics as I
> am. With the repository structure we have now, you can have millions of
> revs of a file and can get to any of them by combining a double-digit
> number of deltas and applying the result to the one plaintext stored in
> the repository.

No -- that's very much what I'm aiming for.

> I also disagree that units of atomicity are always of tractable size.
> gcc and Linux and Mozilla all require units of atomicity which are
> pretty damn big, assuming you can split them up at all. And if you do
> split them up, you'll probably want atomic commits across the units with
> some frequency.

My opinion strongly differs here -- and this is an area where having
someone spend 2-6 weeks studying the issue objectively is the right
thing.

>> It would have been much wiser, a few years back, to
>> implement commits in terms of tree-copies, not fs revision numbers.

> I guess the basic idea here is that the repository would only
> serve the head revision,

The head _repository revision_, yes -- the head _project tree
revision_, no. Of course you'd have access to all project tree
revisions, since they'd all be present in the head repository revision
(though they wouldn't all have equal access performance, of course).

> you'd commit by copying the trunk (or
> project) directory and modifying it, and an update is like a
> switch.

Pretty much, yeah.

> But that's not a complete vision: how does "svn update"
> know what to switch to? What URL would correspond to "the head
> of the trunk of the Subversion project" when the path of the
> head changes with each commit?

A path that didn't include a patch-level (aka project revision number)
would refer to the same path plus the highest numbered patch-level for
that path at the time of the start of the txn.

> What restrictions does the
> repository enforce to prevent history from disappearing the
> space of what clients can access?

First class project trees.

Maybe this will help: the portions of the fs namespace that include
project tree revision numbers would have write-once semantics -- the
portions that don't would be sort of like (changing) symbolic links to
that write-once portion.

> Eliminating the revision number by making it effectively part of
> the path gains elegance in some areas, but loses it in others.

I don't see any losses in what you've described so far.

> And the only objective gain I've seen you describe has to do
> with the theoretical maximum commit throughput of a repository
> distributed across many servers with different servers taking
> synchronization responsibility for different parts of the
> namespace. That's just not a compelling argument.

This distribution thing you keep mentioning:

1) Gee, you know, actively planning _against_ that seems
short-sighted.

2) It's a red herring. The same properties that help multi-server
   implementations help get good performance out of single server
   implementations that are simpler and have fewer dependencies on
   3rd party packages.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Apr 14 07:26:08 2003

This message: [ Message body ]
Next message: Tom Lord: "Re: [PROPOSAL] Merging Improved"
Previous message: junkio_at_cox.net: "Local changes lost across an upstream rename?"
In reply to: Greg Hudson: "Re: [PROPOSAL] Merging Improved"
Next in thread: Greg Hudson: "Re: [PROPOSAL] Merging Improved"
Reply: Greg Hudson: "Re: [PROPOSAL] Merging Improved"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]