[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Changesets vs Wet Blankets

From: Tom Lord <lord_at_emf.net>
Date: 2003-04-15 00:11:52 CEST

This is a coalesced reply and, as such, ridiculously long. To help
skip to the parts of interest to particular readers, here is a table
of contents, suitable for use with search commands to find

  * Sander Looking for Big Holes in His Proposal
  * On Karl's Priorities
  * On Merge Histories and Project Management
  * On Tree Deltas and File Identity
  * On Native-FS Storage Mgt.
  * On Ethics, Business, and "CollabNet's Corporate Interests"
  * Separating V.A.P. from Merge History


* Sander Looking for Big Holes in His Proposal

          Karl Fogel ("KF"):
        Sander ("S"):

        KF> I'm not mad at anyone here -- I was seduced too.

        S> *grin* It's a wonderful problem to solve ;).

  I think we are strange birds of related species :-) Elswhere you
  comment on not seeing any "big holes" shot in your proposal. Well,
  I think your proposal is both logically coherent and useful -- so I
  don't expect to shoot any big holes in it in that sense. On the
  other hand, I think it doesn't really address the priority rev ctl
  needs for implementing business rules and project mgt policies, and
  I think its notion of logical file identity (see below) is highly
  problematic both for usability and for exporting changesets to users
  not connected to your repository. Those are big holes of a
  different sort.

* On Karl's Priorities

  Karl says:

        Karl Fogel ("KF"):

        KF> Right now, I see all these exciting posts about adding
        KF> merge support that goes far beyond what CVS does -- and
        KF> then I go look at the bug database and reality hits me
        KF> hard. [....]


        KF> I'm not mad at anyone here -- I was seduced too. But
        KF> truly, it's unrealistic to expect to ever release 1.0 if
        KF> we don't stop with the new features and start dealing with
        KF> the issues we already have. [....]

  But he also says:

        KF> [....] Sure, the changeset discussion has been productive,
        KF> in the sense that many design issues are being worked out
        KF> (without even any major flamewars, which is nice!). [....]

  And I'd like to honor that perspective.

  First, I agree that "the changeset discussion", which has expanded in
  scope to include storage managment and UI-post-1.0 and tricky
  questions about the relationship between businesses and public free
  software projects, has been productive, (I'd say _at least_) in the
  sense that interesting and important areas of the design spaces are
  being productively explored. But, yes, I note that the issue count
  isn't down to 60 something, yet -- I think it went up 1 or 2 since the
  last time I saw the count message. Additionally, it's concurrent with
  trying to put out fires like the recent svn.collabnet.net snafus
  which, while they're illustrative of the desirability to reconsider
  certain storage mgt decisions, are also -- well, the fires that need
  putting out.

  So let me try to do my part not by dropping the discussion entirely
  _but_ by trying to turn it down to a gentle simmer. Thus, I bring
  you a compacted and coallesced reply to ghudson, dberlin, gstein,
  brane, sander, and kfogel, including my attempts to sum up the

* On Merge Histories and Project Management

  I asserted that business rules and project management are typically
  concerned with the histories of project trees and project-tree
  development lines rather than the histories of individual files or
  directories. Furthermore, I added that the merge history mechanism
  in Sander's proposal does not seem to provide a convenient
  project-tree or project-line merge history -- but rather that such
  histories would be expensive and tricky to compute from the merge
  records kept in Sander's proposal. This is one of several reasons
  why I advocate dedicating resources to work on supporting first-class
  project trees in svn, merging with project-tree granularity, and
  history records kept per-project-tree.

  Brane asserts that in Sander's proposal, a directory (presumably the
  top level directory of a project tree) will have a merge history
  that is representative of the project tree as a whole:

        Brane ("B"):

        B> Any directory can serve as such a node. Even when they're
        B> not explicitly modified by a commit, every commit creates a
        B> new version of the directories that contain the changed
        B> files, all the way to the root, because of the "bubble-up"
        B> rule. These directories will be included in every merge.

  Of course, Sander hasn't spelled out his tree-delta rules, so we
  have to reason about the design space for them.

  You, Brane, seem to be saying that if any of the nested contents of
  a directory are modified by a merge originating in a different
  directory, even if the node_id/copy_id contents of the directory are
  not themselves altered, that the directory will acquire the
  necessary change records. There are two significant problems with
  that supposition.

  First, a merge to some nested part of a directory may be only a
  partial merge of the changes associated with a particular range of
  repository revisions on the merged-from branch. It is an expensive
  and tricky computation to figure out whether a given merge is
  complete or partial. Thus, when the merge history "bubbles up",
  what should the containing directory record? There is no mechanism
  in Sander's proposal for recording "part of the nested contents of
  this directory have been merged; other parts haven't".
  Additionally, just deciding whether it was a partial or complete
  merge hasn't been addressed.

  Second, since repositories are free-form, and the top-level
  directory layout "just a recommendation", let's suppose that instead
  of making "/branches", I make "/branches/division1" and
  "/branches/division2" where division1 and division2 are two parts of
  my development organization. Furthermore, let's suppose that I'm
  storing 15 or 20 projects in this repository rather than just one.
  If records "bubble up", as you propose, then the merge histories for
  "/branches/division[12]" are going to be both huge and largely

  So I think that, in the design space for how to handle tree-delta
  merge histories, the only practical option is to not bubble up --
  but to modify directory node histories only when the
  name/node_id/copy_id contents of a directory change. And in that
  case, as I said, there is no merge history that reliably records the
  history of a project tree; to compute the merge history of a
  project tree will be both expensive and tricky.

  Greg offers this twist:

        Greg Hudson ("GH"):

        GH> Merge records don't bubble up, but changes do. So any
        GH> merge done from the top level of a project will involve
        GH> changing the root directory of the project, and therefore
        GH> adding a merge record to that directory. Or so I would
        GH> expect, anyway. (Just like the "file that changes with
        GH> every merge" which you suggested as a workaround
        GH> previously.)

  I think your expectation is either incorrect or involves UI

  The "change to the project root" is not necessarily anything more
  than a change to the revision numbers of its name:node_id.copy_id
  contents. Similar changes are made to each parent directory of the
  root. So if such changes trigger merge-history updates, then all
  parent directories up to the root are similarly effected, and the
  problems I described above still apply. But that would be "bubbling
  up" and you said "merge records don't bubble up" so:

  I assume that you mean, then, that the directory named in the
  arguments to `merge' gets an updated merge-history if any nested
  contents of that directory are modified by the merge. Yet this
  approach is ultimately UI-incoherent for at least two reasons.
  First, it invites the user error of performing a complete merge in
  the effected subtrees, yet failing to have that noted for the
  project tree. Second, it invites the user error of committing
  effected subtrees but never committing the project tree directory.
  (So, on those two points, there is yet again a dissonance between
  the "policy free" nature of the storage manager an the "project
  tree" orientation of sane practices -- so, as elsewhere, I say it
  makes sense to treat storage mgt and rev ctl as layers, project
  trees as first class objects in the rev ctl layer, and to move to a
  UI that reflects the rev ctl abstractions rather than the storage
  manager abstractions).

* On Tree Deltas and File Identity

  I asserted that node_id will not work as a mechanism for tree deltas
  because a single project tree can reasonably contain multiple files
  which have the same node_id.

  To expand on my assertion, let me anticipate Sander's reply: that
  node_id PLUS the history of copy/rename operations (as would be
  considered by variance adjusted patching) together add up to a
  robust notion of file identity. To respond to that anticipated
  reply: yes, the history PLUS the node_id add up to a logically
  _coherent_ notion of file identity, but a very poor user interface
  to file identity.

  Consider for example how I would edit a source tree absent version
  control. I'd use `cp' and `mv' and `rm' and `open(2)' and just work
  "free form" until the tree was in a state that just worked.
  Although it isn't explicit without something like an inventory tag
  mechanism, the files in my tree have logical identities. But in the
  general case, the sequence of operations I'd perform to set up those
  files won't work in svn. For example, if I copy a file, then rename
  the original: which of the two copies has the same logical identity
  as the original? In svn, using node_id plus history as logical file
  identity, the original but renamed file inherits the logical
  identity -- if I happen to want it the other way around, I'm
  screwed. And if I happen to introduce a new node, but want it to
  have the same logical identity as an old node, I'm screwed again --
  I have to remember to check out the old node, delete it's contents
  and replace them.

  In short, I think the easiest to understand and use user interface
  comes from adding a first-class notion of logical file identity
  which is unrelated to node_id and unrelated to the history of tree
  rearrangements. I think it makes the most sense to separate out the
  physical history of versioned objects from their logical identities
  for the purpose of whole-tree patching -- to make logical identities
  something that users can attach to files/directories freely and
  rearrange arbitrarilly at will, with ease.

  Another way to look at this issue is to consider generating
  whole-tree changesets which people can apply to (possibly modified)
  source trees that don't have any direct relationship to a
  repository. To apply such a changeset, I need to be able to compute
  the logical identities of files in the source tree without
  consulting the repository -- the history simply isn't available. So
  representing logical identity within the svn metadata of wc plus the
  metadata in a repository doesn't fly -- not if I want to be able to
  distribute changesets to users who have copies of my distributions.

  So: inventory tags, represented in the source tree.

* On Native-FS Storage Mgt.

  Roughly speaking, we're talking about using a changeset journal as
  the definitive record of the state of the repository, and building
  various kinds of supplementary data structures on top of that to
  achieve the access performance characteristics implicitly assumed in
  the svn design. Such a journal is fairly simple to implement on a
  native FS, as are the supplementary data structures.

  I remarked that introducing first-class project trees makes it more
  practical to build a native-fs storage mgr (meaning things like
  easier to implement, easier to understand, easier for users to tune,
  easier to process repositories with fewer code dependencies, etc.). I
  pointed out that it is not unreasonable to use project trees as the
  granularity of atomic commits (e.g., for locking), and that it might
  even be reasonable to tweak the UI semantics by asserting that
  atomic commits are not guaranteed except _within_ individual project
  trees (a tweak that would further simplify implementation).

  I also pointed out that repository revision numbers are a redundant
  mechanism given cheap tree copying, and that they complicate storage
  management implementation by adding the requirement that
  non-interfering commits be, nevertheless, serialized and numbered.

          Greg Hudson ("GH"):

        GH> You seem to have conceded a different point than I stated.
        GH> The app-level changeset journal idea isn't just orthogonal
        GH> to the choice of storage manager; it's orthogonal to the
        GH> idea of project subdirectories.

  I don't think I "conceded" that: I think I said so from the start.
  When I first proposed journaling, it was as a cure for the big BDB
  log file problem, and in that context, project trees weren't

  As a practical matter, journaling and first-class project trees
  interact nicely: two separate logs for two distinct project tree
  develop lines have separate locks. Concurrent commits to the two
  projects simply don't interact. If every commit to a journal has
  to share a single lock just to begin to sort out conflicting
  concurrent commits, that's an added implementation complexity.

  Beyond locking, having the journal pre-sorted by project tree
  development line means that for any query confined to a single
  project tree, I can find all of the relevent journal entries in an
  O(1) operation (just "look at the right part of the journal") rather
  than having to search the journal or rely on an external index of
  it. An example of such a query is a client asking: "is my wc of
  this project tree up to date?" -- that translates into "give me the
  list of journal entries for this project tree".

>> For svn-like performance, I was thinking more of roughly a
>> single full-text, but with indexed changesets and cached
>> skip-delta changesets.

        GH> It seems like by the time you're done with this cache, it
        GH> would look very much like our current filesystem structure.
        GH> Certainly, I don't think it would be any simpler for the
        GH> introduction of project directories.

  It would have some gross structural resemblence to the current
  structure, sure. For example, both would have skip-deltas. Both
  need tools that compute skip deltas.

  Leaving aside the benefits of first class project trees for a
  second: a native-FS cache-based implementation of things like
  skip deltas is simpler because you don't have all the hair (code
  dependencies, performance impacts, etc) of building the the skip
  deltas in the same transactional database that's processing
  commits. The skip delta cache can be a nicely orthogonal
  component with a procedural API and an implementation whose storage
  management issues are separate from everything else and nicely

  Bringing project trees back in: they simplify things by allowing you
  to compute skip deltas for project trees, rather than individual
  nodes. If you index a skip-delta cache by project tree, you'll have
  N compartments of deltas; if you index by node or node/copy_id,
  you'll have N * K where, typically, K is in O([100...5000]). If a
  query needs deltas that pertain to a particular file within a
  project tree, for example, a trivial linear search of some
  delta-chain for that project tree is a practical solution -- given
  first-class project trees.

        GH> Right now the filesystem namespace is uniform and owned
        GH> completely by the user. (The trunk/branches/tags
        GH> convention is just a suggestion to the user; there are no
        GH> plans to make any Subversion software assume or enforce
        GH> the convention.) I consider that elegant. What you're
        GH> suggesting makes the top one or two levels of the
        GH> namespace owned by the implementation, and introduces the
        GH> notion of an implicit symlink for the head revision of a
        GH> project directory.

  Well, it's an "elegant _something_". I agree with that. By
  startling coincidence, slightly before svn started, I wrote a very
  similar little free-form transactional file system, layered on a
  library for "[functional, transactional] persistent hash tries"
  rather than BDB. It was a neat approach for many reasons, one of
  which, amusingly enough, is that it didn't need a write-ahead log to
  achieve ACID properties. Considered _just_as_ a transactional file
  system, I think the elegance you refer to is _profound_ -- it's a
  hop, skip, and a bounce away from being an RDBMS-killer, for
  example. The big challenge, there, is to get to an implementation
  that satisfies a very wide range of access-pattern performance
  constraints -- and the first step in that challenge is figuring out
  a semantics that doesn't preclude (nay, even prepares) achieving
  such performance.

  But is the model an "elegent revision control sytem" in and of
  itself? I don't think so -- at least not for projects "at scale".
  Rather, I think that, from a birds-eye perspective, the txnal file
  system is an elegant storage manager for revision control -- but the
  additional structure is critical to supporting the roles of revision
  control in project mgt., auditting, scalable development and so
  forth. On the topic of "trunk/branches/tags" -- another thread has
  talked about the mechanism to initialize that recommended structure
  and in that thread, someone or other said (and I'm paraphrasing
  here, but just barely) "svn is _useless_ [for revision control]
  without that additional structure". You may or may not recall my
  advocating, many months ago, the idea of reconceptualizing the
  project into two layers: a txnal fs storage manager, and a revision
  control system layered on top of that. I still advocate that.

  So as for those top few directory levels being "owned by the
  implementation" -- I'd say, they are generic as far as the txnal fs
  is concerned, but special as far as the revision control system is
  concerned. The documentation already recommends usage patterns that
  make that so -- it pays off nicely to leverage that in the

        GH> On consideration, I don't think you gain any elegance in
        GH> the UI by eliminating revision numbers, because you still
        GH> need to be able to, say, check out by date, or diff
        GH> against the previous revision of a file. So it wouldn't
        GH> be a simple matter of eliminating the -r option to every
        GH> command which currently takes it.

  At least two ways you gain elegance by assigning revision numbers
  within separate lines of development rather than globally to a
  repository are: (a) the numbers become more meaningful to users, and
  (b) the numbers don't exceed the range implied by the lifetime of a
  development line. An example of (a), the previous change to
  development line at $project-line/<N> is reliably found in
  $project-line/<N-1>, rather than the previous change to
  $project-line@<N> being found in $project-line/<N-$random()>. As an
  example of (b), in my few-years-old repository which contains 15
  busy projects, revision numbers for project lines stay in the 3-4
  digit range rather than the 6-7 digit range.

  You also gain implementation-degrees-of-freedom elegance because
  there are fewer txns that need to be definitively serialized.

        [On scaling to high throughput and multi-server

>> 1) Gee, you know, actively planning _against_ that seems
>> short-sighted.

        GH> Not really. I don't see Subversion evolving in that
        GH> direction. I see better cross-repository support as a
        GH> much better direction to look towards than introducing the
        GH> concept of partitioning within repositories. And even if
        GH> I'm wrong, a central point which does nothing but assign
        GH> revision numbers can scale to a really really high level
        GH> of throughput; you'd have to have millions of servers
        GH> before that central point would become a bottleneck.

  That repo-revnum server becomes an administrative and robustness
  bottleneck and an extra burden on implementations even if, in
  theory, a perfectly functioning instance of such a server is not a
  serious performance bottleneck.

  Beyond that -- while you may choose not to see svn's txnal fs
  scaling that way, I think you have to anticipate txnal fs' coming
  along that _do_ scale that way in the not too distant future. So
  layering a rev ctl system on those is the not too distant future.
  If svn semantics don't fit such a scalable txnal fs, then they
  aren't the not too distant future.

* On Ethics, Business, and "CollabNet's Corporate Interests"

  I remarked, off-handedly to Brane that as a practical matter,
  much of what I was talking about was really of most direct interst
  to CollabNet, even though it came up while talking to him.

  In reply (not just from Brane), I was told that I was "backsliding"
  (?!?) and accused of trying to (paraphrased) "Go over the heads of
  developers." In general, I got the sense that the topic of
  business interests in public free software projects, where the
  businesses in question are participating in such projects, are
  supposed to be somehow "taboo" -- that it is assumed to be impolite
  or crazy or underhanded to even mention them.

  I pointed out that those business interests can not and should not be
  ignored, that there are naturally conflicting agendas in the project
  and that some of those agendas come from the corporate perspective.
  I described this as a "conflict of interest" and suggested that it
  need not be a harmful conflict -- but rather is an interesting and
  useful opportunity for requirements analysis and resource

  Greg Stein had an interesting and helpfully diplomatic take:

        me (">>"):

>> Alas, in saying that, I suppose that I'm in some sense
>> speaking more through you to collabnet than to you
>> directly.

        brane@xbc.nu ("B")

        B> Tom, you're backsliding again. :-) Let's leave CollabNet's
        B> commercial interests out of this.

>> Excuse me?

>> First, I don't find that funny and so I don't understand
>> your ":-)".

        Greg Stein ("GS")

        GS> I think Brane means to work through the design issues and
        GS> leave out the commercial interests. There isn't any reason
        GS> to bring commercial issues to the table when you're doing
        GS> design work.

  Ah. I think that is completely wrong (not your interpretation of
  Brane -- the attitude conveyed in that interpretation). "Commercial
  interests", ideally, are a cost/benefit summary of the interests of
  potential users of the system. They are, therefore, valuable input
  to the design process. In the other direction, the commercial
  interests benefit from "what is the nature of the design space
  reality" input from the design process. So on the contrary --
  "leaving out" commercial interests seems to me to be precisely the
  wrong the thing to do -- bidirectional feedback, on the other hand,
  seems to me to be the right thing to do.

>> Second, are you saying that "CollabNet's commercial
>> interests" do not have a significant impact on those core
>> developers who are employed by svn or that they do not, in
>> turn, have significant impact on the plans for and design
>> of svn?

        GS> Yes and yes. But the impact on the plans/design is granted
        GS> by the community rather than enforced/required by
        GS> CollabNet. There is a large difference there, and one that
        GS> I'm happy about.

  In private communication, I have heard that collabnet's commercial
  interests most definately do have a significant impact on core
  developer's employed by Collabnet, and on the list, I observe that
  those developer's have a significant impact on the project's plans,
  progress, and design efforts.

  It's a sad, sad commentary on the state of _class_ in the world that
  my speaking of the interests of CollabNet evokes in reply (a) a
  defense that they aren't "enforcing", (b) an accusation that I am
  "going over developer's heads". Implicit in both replies is an
  association of corporate interests with AUTHORITY -- in the case of
  (a), weilded benevolantly, in the case of (b) presumed to be a
  potentially hostile force which I'm trying to invoke with a magic

  Healthy business activities are based on the win-win scenario. That
  applies to CollabNet's relationship with it's employees, its
  customers, and its relationship with public projects. I tend to
  believe that the best win-win scenario here is not to deny the
  influence of, or to make a taboo topic of, CollabNet's business
  interests but rather to permit those interests to be a first-class
  topic of discussion with the aim of optimizing their relationship to
  everything else.

  To be slightly more explicit: on the one hand, I have a bunch of
  technology in arch that, as is starting to become clear in various
  branches of this discussion, is applicable to svn. I've tried to
  explain why some of that technology is particularly important to
  commercial users of revision control (e.g., the relationship between
  business rules and merge history). I tend to believe that those are
  issues that are important to consider from the CollabNet
  perspective. At the same time, my financial FUBAR is fairly widely
  known, and that situation has immediate, significant impact on the
  likely future evolution of the technology I'm referring to.

  And just to reiterate:

>> Consequently, ruling CollabNet's business interests to be a
>> taboo topic is extraordinarily inappropriate.]

        GS> It isn't taboo. I think he's just trying to say that it
        GS> doesn't have a place in technical design discussions.

  "backsliding?", "going over developer's heads?" -- those are
  taboo-making replies.

        GS> Talk about users' needs, sure. That is great, and is
        GS> essential. But there isn't much need to worry about
        GS> CollabNet's users specifically. (although I do appreciate
        GS> the consideration :-)

  To state the obvious, `design' or `engineering' doesn't happen in a
  non-economic context. Just the opposite. A hefty fraction of the
  economic resources dedicated to subversion come, precisely, from
  estimates of its commercial value. To say that "design decisions"
  should not consider that fact is naive and seems to me to be rooted
  in a very sad perception of the relationship between corporations
  and the free software world.

  Alas, my replies also just enflamed the argumentation:


>> In such circumstances, in all walks of life, civil society
>> makes the judgement that "There is a conflict of interset
>> there." We _never_ burden people embraced in such a
>> conflict with the responsibility to separate out concerns
>> (a) and (b). We _always_ assume that even the most
>> reasonable, well intentioned persons can not separate out
>> such interests when they collide in a single mind.

        Dan Berlin ("DB"):

        DB> Not quite.

        DB> In law, at least, we allow it in quite a few cases without
        DB> anything further required, and in almost all cases if the
        DB> clients consent after consultation.

  Yes, but "consent" does not always imply that the conflict ceases to
  exist or is presumed to have no effect. It may imply that the
  conflict is presumed to be managable or even beneficial.

  I've suggested that we presume, in this case, that is both managable,
  and beneficial. I'll reiterate a few points:

  CollabNet ideally (and, I think, in practice to some extent)
  represents the interests of an interesting group of potential
  revision control consumers. Their influence should be regarded in
  that light as "valuable suggestions for design and project
  mgt. constraints". For example, when Karl says (paraphrasing)
  "Uh.. this is all interesting and such ... but I think shrinking the
  issue list is priority #1" -- I think it's reasonable to interpret
  that, partly, as "CollabNet's contributions to the project can be
  maximized by minimizing the time to 1.0".

  And I think it's a bi-directional relationship. For example, I
  don't think I'm exactly a slouch in the areas of professional source
  mgt. and revision control. As I said, I've had some interesting
  mentorship in this area, and some interesting communications with
  people dealing with really substantial commercial source
  mgt. headaches --- and of course the experience with arch. So while
  CollabNet has some market insight to bring to the svn project, I
  think I have some market and technology insight to offer to

        DB> I'm not sure whether this puts lawyers in the category of
        DB> unreasonable, non-well-intentioned, or not having a mind,
        DB> or some combination of the three.

        DB> Or of course, you could just be plain wrong.

  Or, perhaps, you are just looking for an excuse to flame and defame
  me without looking very carefully at what I'm actually saying.

        DB> Have you actually looked at the codes of ethical conduct
        DB> adhered to by other licensed practioner fields (IE
        DB> psychiatry, etc)?

  At least as much as the next guy... probably a bit more. I've also
  looked at these issues from some other valuable perspectives such as
  academic philosophy and the philosophy of ethics.

* Separating V.A.P. from Merge History

  Sander's proposal talks about v.a.p. being the merge mechanism,
  but the proposal is really about merge history.

  These are separate concerns. There are many merge algorithms,
  v.a.p. being just one. Merge history is relevant to many of those

  It's a useful touchstone to make sure that the proposed history
  mechanism is sufficient for v.a.p. -- but a mistake to think that
  v.a.p. is the only algorithm it should support.

  Rather than v.a.p. alone -- it would be useful to have a list of
  touchstones by which to judge history mechanisms. These should
  include alternative merge algorithms, sure -- but also uses of merge
  history in situations where repository access is not assured, and
  generation and application of changesets in situations where
  repository access is not assured.


To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Apr 15 00:01:49 2003

This is an archived mail posted to the Subversion Dev mailing list.