This is a coalesced reply and, as such, ridiculously long. To help
skip to the parts of interest to particular readers, here is a table
of contents, suitable for use with search commands to find
subsections.
* Sander Looking for Big Holes in His Proposal
* On Karl's Priorities
* On Merge Histories and Project Management
* On Tree Deltas and File Identity
* On Native-FS Storage Mgt.
* On Ethics, Business, and "CollabNet's Corporate Interests"
* Separating V.A.P. from Merge History
--------------------------------
* Sander Looking for Big Holes in His Proposal
Karl Fogel ("KF"):
Sander ("S"):
KF> I'm not mad at anyone here -- I was seduced too.
S> *grin* It's a wonderful problem to solve ;).
I think we are strange birds of related species :-) Elswhere you
comment on not seeing any "big holes" shot in your proposal. Well,
I think your proposal is both logically coherent and useful -- so I
don't expect to shoot any big holes in it in that sense. On the
other hand, I think it doesn't really address the priority rev ctl
needs for implementing business rules and project mgt policies, and
I think its notion of logical file identity (see below) is highly
problematic both for usability and for exporting changesets to users
not connected to your repository. Those are big holes of a
different sort.
* On Karl's Priorities
Karl says:
Karl Fogel ("KF"):
KF> Right now, I see all these exciting posts about adding
KF> merge support that goes far beyond what CVS does -- and
KF> then I go look at the bug database and reality hits me
KF> hard. [....]
[....]
KF> I'm not mad at anyone here -- I was seduced too. But
KF> truly, it's unrealistic to expect to ever release 1.0 if
KF> we don't stop with the new features and start dealing with
KF> the issues we already have. [....]
But he also says:
KF> [....] Sure, the changeset discussion has been productive,
KF> in the sense that many design issues are being worked out
KF> (without even any major flamewars, which is nice!). [....]
And I'd like to honor that perspective.
First, I agree that "the changeset discussion", which has expanded in
scope to include storage managment and UI-post-1.0 and tricky
questions about the relationship between businesses and public free
software projects, has been productive, (I'd say _at least_) in the
sense that interesting and important areas of the design spaces are
being productively explored. But, yes, I note that the issue count
isn't down to 60 something, yet -- I think it went up 1 or 2 since the
last time I saw the count message. Additionally, it's concurrent with
trying to put out fires like the recent svn.collabnet.net snafus
which, while they're illustrative of the desirability to reconsider
certain storage mgt decisions, are also -- well, the fires that need
putting out.
So let me try to do my part not by dropping the discussion entirely
_but_ by trying to turn it down to a gentle simmer. Thus, I bring
you a compacted and coallesced reply to ghudson, dberlin, gstein,
brane, sander, and kfogel, including my attempts to sum up the
discussion.
* On Merge Histories and Project Management
I asserted that business rules and project management are typically
concerned with the histories of project trees and project-tree
development lines rather than the histories of individual files or
directories. Furthermore, I added that the merge history mechanism
in Sander's proposal does not seem to provide a convenient
project-tree or project-line merge history -- but rather that such
histories would be expensive and tricky to compute from the merge
records kept in Sander's proposal. This is one of several reasons
why I advocate dedicating resources to work on supporting first-class
project trees in svn, merging with project-tree granularity, and
history records kept per-project-tree.
Brane asserts that in Sander's proposal, a directory (presumably the
top level directory of a project tree) will have a merge history
that is representative of the project tree as a whole:
Brane ("B"):
B> Any directory can serve as such a node. Even when they're
B> not explicitly modified by a commit, every commit creates a
B> new version of the directories that contain the changed
B> files, all the way to the root, because of the "bubble-up"
B> rule. These directories will be included in every merge.
Of course, Sander hasn't spelled out his tree-delta rules, so we
have to reason about the design space for them.
You, Brane, seem to be saying that if any of the nested contents of
a directory are modified by a merge originating in a different
directory, even if the node_id/copy_id contents of the directory are
not themselves altered, that the directory will acquire the
necessary change records. There are two significant problems with
that supposition.
First, a merge to some nested part of a directory may be only a
partial merge of the changes associated with a particular range of
repository revisions on the merged-from branch. It is an expensive
and tricky computation to figure out whether a given merge is
complete or partial. Thus, when the merge history "bubbles up",
what should the containing directory record? There is no mechanism
in Sander's proposal for recording "part of the nested contents of
this directory have been merged; other parts haven't".
Additionally, just deciding whether it was a partial or complete
merge hasn't been addressed.
Second, since repositories are free-form, and the top-level
directory layout "just a recommendation", let's suppose that instead
of making "/branches", I make "/branches/division1" and
"/branches/division2" where division1 and division2 are two parts of
my development organization. Furthermore, let's suppose that I'm
storing 15 or 20 projects in this repository rather than just one.
If records "bubble up", as you propose, then the merge histories for
"/branches/division[12]" are going to be both huge and largely
useless.
So I think that, in the design space for how to handle tree-delta
merge histories, the only practical option is to not bubble up --
but to modify directory node histories only when the
name/node_id/copy_id contents of a directory change. And in that
case, as I said, there is no merge history that reliably records the
history of a project tree; to compute the merge history of a
project tree will be both expensive and tricky.
Greg offers this twist:
Greg Hudson ("GH"):
GH> Merge records don't bubble up, but changes do. So any
GH> merge done from the top level of a project will involve
GH> changing the root directory of the project, and therefore
GH> adding a merge record to that directory. Or so I would
GH> expect, anyway. (Just like the "file that changes with
GH> every merge" which you suggested as a workaround
GH> previously.)
I think your expectation is either incorrect or involves UI
incoherence.
The "change to the project root" is not necessarily anything more
than a change to the revision numbers of its name:node_id.copy_id
contents. Similar changes are made to each parent directory of the
root. So if such changes trigger merge-history updates, then all
parent directories up to the root are similarly effected, and the
problems I described above still apply. But that would be "bubbling
up" and you said "merge records don't bubble up" so:
I assume that you mean, then, that the directory named in the
arguments to `merge' gets an updated merge-history if any nested
contents of that directory are modified by the merge. Yet this
approach is ultimately UI-incoherent for at least two reasons.
First, it invites the user error of performing a complete merge in
the effected subtrees, yet failing to have that noted for the
project tree. Second, it invites the user error of committing
effected subtrees but never committing the project tree directory.
(So, on those two points, there is yet again a dissonance between
the "policy free" nature of the storage manager an the "project
tree" orientation of sane practices -- so, as elsewhere, I say it
makes sense to treat storage mgt and rev ctl as layers, project
trees as first class objects in the rev ctl layer, and to move to a
UI that reflects the rev ctl abstractions rather than the storage
manager abstractions).
* On Tree Deltas and File Identity
I asserted that node_id will not work as a mechanism for tree deltas
because a single project tree can reasonably contain multiple files
which have the same node_id.
To expand on my assertion, let me anticipate Sander's reply: that
node_id PLUS the history of copy/rename operations (as would be
considered by variance adjusted patching) together add up to a
robust notion of file identity. To respond to that anticipated
reply: yes, the history PLUS the node_id add up to a logically
_coherent_ notion of file identity, but a very poor user interface
to file identity.
Consider for example how I would edit a source tree absent version
control. I'd use `cp' and `mv' and `rm' and `open(2)' and just work
"free form" until the tree was in a state that just worked.
Although it isn't explicit without something like an inventory tag
mechanism, the files in my tree have logical identities. But in the
general case, the sequence of operations I'd perform to set up those
files won't work in svn. For example, if I copy a file, then rename
the original: which of the two copies has the same logical identity
as the original? In svn, using node_id plus history as logical file
identity, the original but renamed file inherits the logical
identity -- if I happen to want it the other way around, I'm
screwed. And if I happen to introduce a new node, but want it to
have the same logical identity as an old node, I'm screwed again --
I have to remember to check out the old node, delete it's contents
and replace them.
In short, I think the easiest to understand and use user interface
comes from adding a first-class notion of logical file identity
which is unrelated to node_id and unrelated to the history of tree
rearrangements. I think it makes the most sense to separate out the
physical history of versioned objects from their logical identities
for the purpose of whole-tree patching -- to make logical identities
something that users can attach to files/directories freely and
rearrange arbitrarilly at will, with ease.
Another way to look at this issue is to consider generating
whole-tree changesets which people can apply to (possibly modified)
source trees that don't have any direct relationship to a
repository. To apply such a changeset, I need to be able to compute
the logical identities of files in the source tree without
consulting the repository -- the history simply isn't available. So
representing logical identity within the svn metadata of wc plus the
metadata in a repository doesn't fly -- not if I want to be able to
distribute changesets to users who have copies of my distributions.
So: inventory tags, represented in the source tree.
* On Native-FS Storage Mgt.
Roughly speaking, we're talking about using a changeset journal as
the definitive record of the state of the repository, and building
various kinds of supplementary data structures on top of that to
achieve the access performance characteristics implicitly assumed in
the svn design. Such a journal is fairly simple to implement on a
native FS, as are the supplementary data structures.
I remarked that introducing first-class project trees makes it more
practical to build a native-fs storage mgr (meaning things like
easier to implement, easier to understand, easier for users to tune,
easier to process repositories with fewer code dependencies, etc.). I
pointed out that it is not unreasonable to use project trees as the
granularity of atomic commits (e.g., for locking), and that it might
even be reasonable to tweak the UI semantics by asserting that
atomic commits are not guaranteed except _within_ individual project
trees (a tweak that would further simplify implementation).
I also pointed out that repository revision numbers are a redundant
mechanism given cheap tree copying, and that they complicate storage
management implementation by adding the requirement that
non-interfering commits be, nevertheless, serialized and numbered.
Greg Hudson ("GH"):
GH> You seem to have conceded a different point than I stated.
GH> The app-level changeset journal idea isn't just orthogonal
GH> to the choice of storage manager; it's orthogonal to the
GH> idea of project subdirectories.
I don't think I "conceded" that: I think I said so from the start.
When I first proposed journaling, it was as a cure for the big BDB
log file problem, and in that context, project trees weren't
mentioned.
As a practical matter, journaling and first-class project trees
interact nicely: two separate logs for two distinct project tree
develop lines have separate locks. Concurrent commits to the two
projects simply don't interact. If every commit to a journal has
to share a single lock just to begin to sort out conflicting
concurrent commits, that's an added implementation complexity.
Beyond locking, having the journal pre-sorted by project tree
development line means that for any query confined to a single
project tree, I can find all of the relevent journal entries in an
O(1) operation (just "look at the right part of the journal") rather
than having to search the journal or rely on an external index of
it. An example of such a query is a client asking: "is my wc of
this project tree up to date?" -- that translates into "give me the
list of journal entries for this project tree".
me:
>> For svn-like performance, I was thinking more of roughly a
>> single full-text, but with indexed changesets and cached
>> skip-delta changesets.
GH> It seems like by the time you're done with this cache, it
GH> would look very much like our current filesystem structure.
GH> Certainly, I don't think it would be any simpler for the
GH> introduction of project directories.
It would have some gross structural resemblence to the current
structure, sure. For example, both would have skip-deltas. Both
need tools that compute skip deltas.
Leaving aside the benefits of first class project trees for a
second: a native-FS cache-based implementation of things like
skip deltas is simpler because you don't have all the hair (code
dependencies, performance impacts, etc) of building the the skip
deltas in the same transactional database that's processing
commits. The skip delta cache can be a nicely orthogonal
component with a procedural API and an implementation whose storage
management issues are separate from everything else and nicely
modularized.
Bringing project trees back in: they simplify things by allowing you
to compute skip deltas for project trees, rather than individual
nodes. If you index a skip-delta cache by project tree, you'll have
N compartments of deltas; if you index by node or node/copy_id,
you'll have N * K where, typically, K is in O([100...5000]). If a
query needs deltas that pertain to a particular file within a
project tree, for example, a trivial linear search of some
delta-chain for that project tree is a practical solution -- given
first-class project trees.
GH> Right now the filesystem namespace is uniform and owned
GH> completely by the user. (The trunk/branches/tags
GH> convention is just a suggestion to the user; there are no
GH> plans to make any Subversion software assume or enforce
GH> the convention.) I consider that elegant. What you're
GH> suggesting makes the top one or two levels of the
GH> namespace owned by the implementation, and introduces the
GH> notion of an implicit symlink for the head revision of a
GH> project directory.
Well, it's an "elegant _something_". I agree with that. By
startling coincidence, slightly before svn started, I wrote a very
similar little free-form transactional file system, layered on a
library for "[functional, transactional] persistent hash tries"
rather than BDB. It was a neat approach for many reasons, one of
which, amusingly enough, is that it didn't need a write-ahead log to
achieve ACID properties. Considered _just_as_ a transactional file
system, I think the elegance you refer to is _profound_ -- it's a
hop, skip, and a bounce away from being an RDBMS-killer, for
example. The big challenge, there, is to get to an implementation
that satisfies a very wide range of access-pattern performance
constraints -- and the first step in that challenge is figuring out
a semantics that doesn't preclude (nay, even prepares) achieving
such performance.
But is the model an "elegent revision control sytem" in and of
itself? I don't think so -- at least not for projects "at scale".
Rather, I think that, from a birds-eye perspective, the txnal file
system is an elegant storage manager for revision control -- but the
additional structure is critical to supporting the roles of revision
control in project mgt., auditting, scalable development and so
forth. On the topic of "trunk/branches/tags" -- another thread has
talked about the mechanism to initialize that recommended structure
and in that thread, someone or other said (and I'm paraphrasing
here, but just barely) "svn is _useless_ [for revision control]
without that additional structure". You may or may not recall my
advocating, many months ago, the idea of reconceptualizing the
project into two layers: a txnal fs storage manager, and a revision
control system layered on top of that. I still advocate that.
So as for those top few directory levels being "owned by the
implementation" -- I'd say, they are generic as far as the txnal fs
is concerned, but special as far as the revision control system is
concerned. The documentation already recommends usage patterns that
make that so -- it pays off nicely to leverage that in the
implementation.
GH> On consideration, I don't think you gain any elegance in
GH> the UI by eliminating revision numbers, because you still
GH> need to be able to, say, check out by date, or diff
GH> against the previous revision of a file. So it wouldn't
GH> be a simple matter of eliminating the -r option to every
GH> command which currently takes it.
At least two ways you gain elegance by assigning revision numbers
within separate lines of development rather than globally to a
repository are: (a) the numbers become more meaningful to users, and
(b) the numbers don't exceed the range implied by the lifetime of a
development line. An example of (a), the previous change to
development line at $project-line/<N> is reliably found in
$project-line/<N-1>, rather than the previous change to
$project-line@<N> being found in $project-line/<N-$random()>. As an
example of (b), in my few-years-old repository which contains 15
busy projects, revision numbers for project lines stay in the 3-4
digit range rather than the 6-7 digit range.
You also gain implementation-degrees-of-freedom elegance because
there are fewer txns that need to be definitively serialized.
[On scaling to high throughput and multi-server
implementations]
me:
>> 1) Gee, you know, actively planning _against_ that seems
>> short-sighted.
GH> Not really. I don't see Subversion evolving in that
GH> direction. I see better cross-repository support as a
GH> much better direction to look towards than introducing the
GH> concept of partitioning within repositories. And even if
GH> I'm wrong, a central point which does nothing but assign
GH> revision numbers can scale to a really really high level
GH> of throughput; you'd have to have millions of servers
GH> before that central point would become a bottleneck.
That repo-revnum server becomes an administrative and robustness
bottleneck and an extra burden on implementations even if, in
theory, a perfectly functioning instance of such a server is not a
serious performance bottleneck.
Beyond that -- while you may choose not to see svn's txnal fs
scaling that way, I think you have to anticipate txnal fs' coming
along that _do_ scale that way in the not too distant future. So
layering a rev ctl system on those is the not too distant future.
If svn semantics don't fit such a scalable txnal fs, then they
aren't the not too distant future.
* On Ethics, Business, and "CollabNet's Corporate Interests"
I remarked, off-handedly to Brane that as a practical matter,
much of what I was talking about was really of most direct interst
to CollabNet, even though it came up while talking to him.
In reply (not just from Brane), I was told that I was "backsliding"
(?!?) and accused of trying to (paraphrased) "Go over the heads of
developers." In general, I got the sense that the topic of
business interests in public free software projects, where the
businesses in question are participating in such projects, are
supposed to be somehow "taboo" -- that it is assumed to be impolite
or crazy or underhanded to even mention them.
I pointed out that those business interests can not and should not be
ignored, that there are naturally conflicting agendas in the project
and that some of those agendas come from the corporate perspective.
I described this as a "conflict of interest" and suggested that it
need not be a harmful conflict -- but rather is an interesting and
useful opportunity for requirements analysis and resource
provisioning.
Greg Stein had an interesting and helpfully diplomatic take:
me (">>"):
>> Alas, in saying that, I suppose that I'm in some sense
>> speaking more through you to collabnet than to you
>> directly.
brane@xbc.nu ("B")
B> Tom, you're backsliding again. :-) Let's leave CollabNet's
B> commercial interests out of this.
>> Excuse me?
>> First, I don't find that funny and so I don't understand
>> your ":-)".
Greg Stein ("GS")
GS> I think Brane means to work through the design issues and
GS> leave out the commercial interests. There isn't any reason
GS> to bring commercial issues to the table when you're doing
GS> design work.
Ah. I think that is completely wrong (not your interpretation of
Brane -- the attitude conveyed in that interpretation). "Commercial
interests", ideally, are a cost/benefit summary of the interests of
potential users of the system. They are, therefore, valuable input
to the design process. In the other direction, the commercial
interests benefit from "what is the nature of the design space
reality" input from the design process. So on the contrary --
"leaving out" commercial interests seems to me to be precisely the
wrong the thing to do -- bidirectional feedback, on the other hand,
seems to me to be the right thing to do.
>> Second, are you saying that "CollabNet's commercial
>> interests" do not have a significant impact on those core
>> developers who are employed by svn or that they do not, in
>> turn, have significant impact on the plans for and design
>> of svn?
GS> Yes and yes. But the impact on the plans/design is granted
GS> by the community rather than enforced/required by
GS> CollabNet. There is a large difference there, and one that
GS> I'm happy about.
In private communication, I have heard that collabnet's commercial
interests most definately do have a significant impact on core
developer's employed by Collabnet, and on the list, I observe that
those developer's have a significant impact on the project's plans,
progress, and design efforts.
It's a sad, sad commentary on the state of _class_ in the world that
my speaking of the interests of CollabNet evokes in reply (a) a
defense that they aren't "enforcing", (b) an accusation that I am
"going over developer's heads". Implicit in both replies is an
association of corporate interests with AUTHORITY -- in the case of
(a), weilded benevolantly, in the case of (b) presumed to be a
potentially hostile force which I'm trying to invoke with a magic
spell.
Healthy business activities are based on the win-win scenario. That
applies to CollabNet's relationship with it's employees, its
customers, and its relationship with public projects. I tend to
believe that the best win-win scenario here is not to deny the
influence of, or to make a taboo topic of, CollabNet's business
interests but rather to permit those interests to be a first-class
topic of discussion with the aim of optimizing their relationship to
everything else.
To be slightly more explicit: on the one hand, I have a bunch of
technology in arch that, as is starting to become clear in various
branches of this discussion, is applicable to svn. I've tried to
explain why some of that technology is particularly important to
commercial users of revision control (e.g., the relationship between
business rules and merge history). I tend to believe that those are
issues that are important to consider from the CollabNet
perspective. At the same time, my financial FUBAR is fairly widely
known, and that situation has immediate, significant impact on the
likely future evolution of the technology I'm referring to.
And just to reiterate:
>> Consequently, ruling CollabNet's business interests to be a
>> taboo topic is extraordinarily inappropriate.]
GS> It isn't taboo. I think he's just trying to say that it
GS> doesn't have a place in technical design discussions.
"backsliding?", "going over developer's heads?" -- those are
taboo-making replies.
GS> Talk about users' needs, sure. That is great, and is
GS> essential. But there isn't much need to worry about
GS> CollabNet's users specifically. (although I do appreciate
GS> the consideration :-)
To state the obvious, `design' or `engineering' doesn't happen in a
non-economic context. Just the opposite. A hefty fraction of the
economic resources dedicated to subversion come, precisely, from
estimates of its commercial value. To say that "design decisions"
should not consider that fact is naive and seems to me to be rooted
in a very sad perception of the relationship between corporations
and the free software world.
Alas, my replies also just enflamed the argumentation:
me:
>> In such circumstances, in all walks of life, civil society
>> makes the judgement that "There is a conflict of interset
>> there." We _never_ burden people embraced in such a
>> conflict with the responsibility to separate out concerns
>> (a) and (b). We _always_ assume that even the most
>> reasonable, well intentioned persons can not separate out
>> such interests when they collide in a single mind.
Dan Berlin ("DB"):
DB> Not quite.
DB> In law, at least, we allow it in quite a few cases without
DB> anything further required, and in almost all cases if the
DB> clients consent after consultation.
Yes, but "consent" does not always imply that the conflict ceases to
exist or is presumed to have no effect. It may imply that the
conflict is presumed to be managable or even beneficial.
I've suggested that we presume, in this case, that is both managable,
and beneficial. I'll reiterate a few points:
CollabNet ideally (and, I think, in practice to some extent)
represents the interests of an interesting group of potential
revision control consumers. Their influence should be regarded in
that light as "valuable suggestions for design and project
mgt. constraints". For example, when Karl says (paraphrasing)
"Uh.. this is all interesting and such ... but I think shrinking the
issue list is priority #1" -- I think it's reasonable to interpret
that, partly, as "CollabNet's contributions to the project can be
maximized by minimizing the time to 1.0".
And I think it's a bi-directional relationship. For example, I
don't think I'm exactly a slouch in the areas of professional source
mgt. and revision control. As I said, I've had some interesting
mentorship in this area, and some interesting communications with
people dealing with really substantial commercial source
mgt. headaches --- and of course the experience with arch. So while
CollabNet has some market insight to bring to the svn project, I
think I have some market and technology insight to offer to
CollabNet.
DB> I'm not sure whether this puts lawyers in the category of
DB> unreasonable, non-well-intentioned, or not having a mind,
DB> or some combination of the three.
DB> Or of course, you could just be plain wrong.
Or, perhaps, you are just looking for an excuse to flame and defame
me without looking very carefully at what I'm actually saying.
DB> Have you actually looked at the codes of ethical conduct
DB> adhered to by other licensed practioner fields (IE
DB> psychiatry, etc)?
At least as much as the next guy... probably a bit more. I've also
looked at these issues from some other valuable perspectives such as
academic philosophy and the philosophy of ethics.
* Separating V.A.P. from Merge History
Sander's proposal talks about v.a.p. being the merge mechanism,
but the proposal is really about merge history.
These are separate concerns. There are many merge algorithms,
v.a.p. being just one. Merge history is relevant to many of those
algorithms.
It's a useful touchstone to make sure that the proposed history
mechanism is sufficient for v.a.p. -- but a mistake to think that
v.a.p. is the only algorithm it should support.
Rather than v.a.p. alone -- it would be useful to have a list of
touchstones by which to judge history mechanisms. These should
include alternative merge algorithms, sure -- but also uses of merge
history in situations where repository access is not assured, and
generation and application of changesets in situations where
repository access is not assured.
-t
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Apr 15 00:01:49 2003