Re: [subversion-dev] Subversion design document up

From: Zack Weinberg <zack_at_wolery.cumb.org>
Date: 2000-06-03 23:03:11 CEST

On Sat, Jun 03, 2000 at 12:28:14AM -0500, Karl Fogel wrote:
> Okay, answers to a lot of questions can now be found at:
>
> http://subversion.tigris.org/project_docs.html
>
> Programmers looking for design and API descriptions, this is the place.
>
> We're going to start coding now. It would of course be silly to ask
> for volunteers before we have running code or even header files :-),
> but if anyone has comments/thoughts on the design so far, please post.

I have a number of suggestions. Some of them may already be covered
but it's not immediately obvious.

There are two separate merge schemes described. One handles merges
between branches (clones), another is for when you have edited your
working copy and someone else commits changes under your feet. The
latter scheme is substantially less powerful, but handles the common
case; this is bad.

A better arrangement would be to cause an implicit branch to occur
when the conflict is detected. If B is the base, Z is my changes and
Y is yours, create this subtree:

B --- Z
\--- Y

Then merge Y into Z using the branch-merge algorithm. There are
multiple advantages to this approach. First, the branch-merge
algorithm has more information available to it and is therefore less
likely to mangle the merge. Second, if you go on working on your
implicit branch and don't pick up my changes:-

   B --- Z -- M
    \ /
     --- Y ------ Y2

then I can pick up Y2 using the ancestor-set algorithm to avoid
merging the changes from Y again. Third, it's possible to recover
the original Z, which did not include the changes in Y. If I discover
that Y is unusable, I can go back to Z and continue working on my own
code until the Y branch becomes functional again.

           ------ Z2 .... Zn
          / \
   B --- Z -- M M2
    \ / /
     --- Y ------ Y2 .... Yn

Obviously a scheme like this is going to use lots of implicit
branches. I *think* the clone mechanism can handle them just fine,
and if it can't it should be made to do so - you really don't want
multiple branch abstractions.

You may or may not have noticed that this requires "working
repositories". That's listed under blue sky items at the moment,
probably because you don't see why it's a critical feature. The
psychological comfort is _not_ the important reason for personal
repositories. The primary reason is that it makes the above merge
technique *possible*. You cannot implement it unless Z is a
full-fledged version instead of a set of modified working files.

Another reason to do working repositories immediately is that it gets
you mirror repositories for free. A mirror is just a working
repository with an active server to export it to the world. And that,
in turn, means you can distribute your own personal hacks on top of
some tree. Sounds like a marginal feature? Not so. Consider this
hierarchy:

             linus
         / | \
      alan ingo davem
      / \
     rik ingo

... All the lower nodes being people who have maintained diffs
relative to the 'official' kernel tree - or relative to someone else's
modified tree! - for extended periods of time. Or, try this one on
for size:

           4.4
           | \
           F N
           | X | \
           F N O
           | X | X |
           F N O
           : : :

which doesn't convey either the exact history or the extent of the
patches being lobbed back and forth, but what do you expect from ASCII
art...

And another reason why you want working repositories is it cuts the
number of primitive operations over the network down to one. Yes,
one. It's called "resync", and it works rather like rsync (hence the
name). Each root node must have a unique identifier for this - in the
presence of distributed repositories, the serial number does not
suffice. Anyway, a resync has a source and a destination. The
destination sends to the source a list of all its node IDs. The
source calculates the set of all nodes present in its repository which
the destination does not have, and sends the deltas for those nodes
back. Destination applies those nodes, possibly resolving conflicts.
The conflict-resolution algorithm is just the normal branch-merge
algorithm.

Optionally, source can send back a list of nodes that destination
has but it doesn't, and destination can send those to source. Also
optionally, destination can remember all the nodes known to be present
in both trees, so it doesn't have to send a complete list to source.
But the lists are going to be small, anyway.

Creating a mirror or personal repository is just the special case
where the destination's list is empty. You probably want a special
transfer protocol for this case, which doesn't break everything down
into deltas and build it up again at the end.

---
There isn't a lot of discussion of how data and metadata are going to
be stored on disk.  (One offhand mention of berkeley DB and/or a full
blown SQL database - either one seems like overkill to me, frankly.)
I would suggest that each node in the tree be two files.  One contains
metadata and is only ever appended to.  Another contains the file data
(if any - directory nodes don't need this) and is updated by writing a
new copy and renaming it into place.  You may or may not need a
per-file locking protocol.
Everything should be checksummed.  Exercise: Pick any large CVS
repository.  Scan through all of the files.  I guarantee you will find
at least one file which has lost at least one disk block: 2^n bytes
will have been overwritten with nulls.  Odds are that happened so long
ago that no backup records the original contents of that block.  No
one has noticed because no one's wanted a revision back that far -
yet.
For storing file data, I'd recommend looking at the "weave" algorithm
used by SCCS.  It stores the file as an array of lines with
annotations which say whether or not a given line belongs in some
revision.  This has two major wins over the current+reverse diffs
format used by RCS.  First, all revisions take the same time to
extract, and you do not have to edit the working file in memory.  A
single linear scan over the archive file, selecting lines and writing
them out, suffices.  This is particularly desirable if you have many
branches.  Second, it provides the concept of included and excluded
deltas, which let you do many merge operations just by adding an entry
to the file index - no modification to the archive is needed.
The trouble with the weave algorithm is it's line oriented and
therefore won't play nice with vcdiff.  SCCS itself handles binary
files horribly (it uuencodes them and then tries to diff the results,
yuck).  But SCCS' real problem with binaries is it uses \n\001 as an
escape sequence.  You could trivially avoid that, and with a bit more
effort I bet weave could be made to work on arbitrary byte runs
instead of lines.
zw

Received on Sat Oct 21 14:36:05 2006

This message: [ Message body ]
Next message: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Previous message: Karl Fogel: "Subversion design document up"
In reply to: Karl Fogel: "Subversion design document up"
Next in thread: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Reply: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Maybe reply: Karl Fogel: "Re: [subversion-dev] Subversion design document up"
Reply: Jonathan S. Shapiro: "Re: [subversion-dev] Subversion design document up"
Maybe reply: Karl Fogel: "Re: [subversion-dev] Subversion design document up"
Maybe reply: Karl Fogel: "Re: [subversion-dev] Subversion design document up"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]