Re: [subversion-dev] Subversion design document up

From: Karl Fogel <kfogel_at_galois.collab.net>
Date: 2000-06-05 20:07:46 CEST

Regarding the local-repository-branching issue you mentioned:

I think there's general agreement that it's an important thing to
support, but everything that comes before it is necessary
understructure anyway. In other words, saying "We need to have this
feature in the 1.0 release" would be more a matter of shifting the
"1.0" label moment than about changing when implementation of the
feature actually happens.

Since our stated goal for 1.0 has simply been to support basically
what CVS supports, this feature would go in the "immediate post-1.0"
category. Along with "cvs annotate" functionality, local-repos branch
support is probably our highest priority following the initial
release.

Regarding storage: it has to work, and not corrupt data, yes. :-)

-Karl

Zack Weinberg <zack@wolery.cumb.org> writes:
> I have a number of suggestions. Some of them may already be covered
> but it's not immediately obvious.
>
> There are two separate merge schemes described. One handles merges
> between branches (clones), another is for when you have edited your
> working copy and someone else commits changes under your feet. The
> latter scheme is substantially less powerful, but handles the common
> case; this is bad.
>
> A better arrangement would be to cause an implicit branch to occur
> when the conflict is detected. If B is the base, Z is my changes and
> Y is yours, create this subtree:
>
> B --- Z
> \--- Y
>
> Then merge Y into Z using the branch-merge algorithm. There are
> multiple advantages to this approach. First, the branch-merge
> algorithm has more information available to it and is therefore less
> likely to mangle the merge. Second, if you go on working on your
> implicit branch and don't pick up my changes:-
>
> B --- Z -- M
> \ /
> --- Y ------ Y2
>
> then I can pick up Y2 using the ancestor-set algorithm to avoid
> merging the changes from Y again. Third, it's possible to recover
> the original Z, which did not include the changes in Y. If I discover
> that Y is unusable, I can go back to Z and continue working on my own
> code until the Y branch becomes functional again.
>
> ------ Z2 .... Zn
> / \
> B --- Z -- M M2
> \ / /
> --- Y ------ Y2 .... Yn
>
> Obviously a scheme like this is going to use lots of implicit
> branches. I *think* the clone mechanism can handle them just fine,
> and if it can't it should be made to do so - you really don't want
> multiple branch abstractions.
>
> You may or may not have noticed that this requires "working
> repositories". That's listed under blue sky items at the moment,
> probably because you don't see why it's a critical feature. The
> psychological comfort is _not_ the important reason for personal
> repositories. The primary reason is that it makes the above merge
> technique *possible*. You cannot implement it unless Z is a
> full-fledged version instead of a set of modified working files.
>
> Another reason to do working repositories immediately is that it gets
> you mirror repositories for free. A mirror is just a working
> repository with an active server to export it to the world. And that,
> in turn, means you can distribute your own personal hacks on top of
> some tree. Sounds like a marginal feature? Not so. Consider this
> hierarchy:
>
> linus
> / | \
> alan ingo davem
> / \
> rik ingo
>
> ... All the lower nodes being people who have maintained diffs
> relative to the 'official' kernel tree - or relative to someone else's
> modified tree! - for extended periods of time. Or, try this one on
> for size:
>
> 4.4
> | \
> F N
> | X | \
> F N O
> | X | X |
> F N O
> : : :
>
> which doesn't convey either the exact history or the extent of the
> patches being lobbed back and forth, but what do you expect from ASCII
> art...
>
> And another reason why you want working repositories is it cuts the
> number of primitive operations over the network down to one. Yes,
> one. It's called "resync", and it works rather like rsync (hence the
> name). Each root node must have a unique identifier for this - in the
> presence of distributed repositories, the serial number does not
> suffice. Anyway, a resync has a source and a destination. The
> destination sends to the source a list of all its node IDs. The
> source calculates the set of all nodes present in its repository which
> the destination does not have, and sends the deltas for those nodes
> back. Destination applies those nodes, possibly resolving conflicts.
> The conflict-resolution algorithm is just the normal branch-merge
> algorithm.
>
> Optionally, source can send back a list of nodes that destination
> has but it doesn't, and destination can send those to source. Also
> optionally, destination can remember all the nodes known to be present
> in both trees, so it doesn't have to send a complete list to source.
> But the lists are going to be small, anyway.
>
> Creating a mirror or personal repository is just the special case
> where the destination's list is empty. You probably want a special
> transfer protocol for this case, which doesn't break everything down
> into deltas and build it up again at the end.
>
> ---
>
> There isn't a lot of discussion of how data and metadata are going to
> be stored on disk. (One offhand mention of berkeley DB and/or a full
> blown SQL database - either one seems like overkill to me, frankly.)
> I would suggest that each node in the tree be two files. One contains
> metadata and is only ever appended to. Another contains the file data
> (if any - directory nodes don't need this) and is updated by writing a
> new copy and renaming it into place. You may or may not need a
> per-file locking protocol.
>
> Everything should be checksummed. Exercise: Pick any large CVS
> repository. Scan through all of the files. I guarantee you will find
> at least one file which has lost at least one disk block: 2^n bytes
> will have been overwritten with nulls. Odds are that happened so long
> ago that no backup records the original contents of that block. No
> one has noticed because no one's wanted a revision back that far -
> yet.
>
> For storing file data, I'd recommend looking at the "weave" algorithm
> used by SCCS. It stores the file as an array of lines with
> annotations which say whether or not a given line belongs in some
> revision. This has two major wins over the current+reverse diffs
> format used by RCS. First, all revisions take the same time to
> extract, and you do not have to edit the working file in memory. A
> single linear scan over the archive file, selecting lines and writing
> them out, suffices. This is particularly desirable if you have many
> branches. Second, it provides the concept of included and excluded
> deltas, which let you do many merge operations just by adding an entry
> to the file index - no modification to the archive is needed.
>
> The trouble with the weave algorithm is it's line oriented and
> therefore won't play nice with vcdiff. SCCS itself handles binary
> files horribly (it uuencodes them and then tries to diff the results,
> yuck). But SCCS' real problem with binaries is it uses \n\001 as an
> escape sequence. You could trivially avoid that, and with a bit more
> effort I bet weave could be made to work on arbitrary byte runs
> instead of lines.
>
> zw
Received on Sat Oct 21 14:36:05 2006

This message: [ Message body ]
Next message: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Previous message: Daniel L. Rall: "Re: [subversion-dev] Subversion design document up"
Maybe in reply to: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Next in thread: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"
Reply: Zack Weinberg: "Re: [subversion-dev] Subversion design document up"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]