cvs2svn and branches (long)

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2003-03-07 22:34:41 CET

For a while now, I've been pondering the problem of how cvs2svn should
map CVS tags/branches to Subversion tags/branches. It's definitely
time to come out of the closet and discuss this on the list :-).

Marko, I know you've been thinking about this problem too, and that
much of this mail is old news to you. You probably even have some
solutions already. But let's nail down the desired behavior in prose,
before making any further code changes. This problem is complex
enough that we should make sure we all agree on the goals, before
writing code to meet those goals.

All examples below use the standard Greek tree, as described in
subversion/tests/greek-tree.txt. Also, the text below just talks
about branches, for simplicity, but I think all the same arguments
apply to tags.

First let's take the simple case of branching a single file:

Single File:
============

Suppose the user does this:

$ cvs tag -b SMALL_BRANCH A/D/G/pi

In Subversion, it's reasonable for that to appear in the repository
like this:

/branches/SMALL_BRANCH/A/D/G/pi

This will require the intermediate directories to be made in the txn
first. In the txn, the intermediate directories should be done with
mkdir, not copy. This is because the CVS user did not do anything to
indicate directoryish intentions, so to speak. Only pi itself would
be the result of a copy.

Note also that pi would be the only leaf node under that branch. For
example, these would *not* exist:

   /branches/SMALL_BRANCH/iota
   /branches/SMALL_BRANCH/A/D/H
   /branches/SMALL_BRANCH/A/D/G/rho
   /branches/SMALL_BRANCH/A/D/G/tau
   ... etc ...

(Unless, of course, someone had branched another file using the same
branch name, coincidentally or otherwise.)

Anyway, I think the above is the most reasonable result for the single
file case, since CVS doesn't version directories, and Subversion can
only represent a branch by means of a corresponding path under the
'/branches' directory.

Multiple Files:
===============

The logic for branching a set of files is similar to the single-file
logic. Of course, in CVS, you can't distinguish between a directory
being branched and the files in it *individually* being branched, or
some subset of them being branched... which leads us to some
interesting choices. Plus there's the issue of mixed-revision
branches (possible when the user did 'cvs tag' instead of 'cvs rtag').

Ideally, when we can detect that a majority of files in a given
directory have all been branched to the same branch, and that it
happened on revisions which were all in HEAD together at some point in
time, then we would create the parent directory using copy instead of
mkdir. For example

$ cvs tag -b MEDIUM_BRANCH A/D/G

should cause G and all its children to exist in the repository:

   /branches/SMALL_BRANCH/A/D/G/
   /branches/SMALL_BRANCH/A/D/G/pi
   /branches/SMALL_BRANCH/A/D/G/rho
   /branches/SMALL_BRANCH/A/D/G/tau

In the Subversion txn, this would look something like

   $ svn mkdir /trunk /branches/MEDIUM_BRANCH
   $ svn mkdir /trunk/A /branches/MEDIUM_BRANCH/A
   $ svn mkdir /trunk/A/D /branches/MEDIUM_BRANCH/A/D
   $ svn cp /trunk/A/D/G /branches/MEDIUM_BRANCH/A/D/G

It's possible that /branches/MEDIUM_BRANCH might already exist, say,
because someone did 'cvs tag -b MEDIUM_BRANCH A/D/H'. That's okay; we
only create the directories we need to create. The deeper copies will
override the higher ones in 'svn log' and such.

The real trickiness starts when we try to deduce the revision/date
that will catch the largest possible set of files that belong in the
initial branch creation. Files in that set can be copied implicitly,
by being under the highest relevant directory (from that rev/date),
which gets copied to the appropriate place in /branches/PATH.../. All
files not in that set must be manually copied into the new branch, of
course.

It's also possible to have situations where all but a few files are
branched (in CVS, maybe the list of files was given explicitly, or
maybe someone deleted the branch tag from certain files later). In
such cases, we probably want to copy the highest possible parent
directory, as usual, then *delete* the unbranched files from the new
copy, before committing the txn.

All of this implies that before committing anything to Subversion,
we'll need to gather the full set of CVS commits. (Is that where
you're heading in your patches, re a "cvs2svn-data.commits" file?).

And we'll have to consider the *creation* of branches and tags to be
like CVS commits, in the sense that they map to Subversion commits,
even though in CVS these actions do not involve running 'cvs ci'.

In other words, there will be more commits in the Subversion
repository than there were in CVS. Thus, the total number of
Subversion commits will be roughly the number times someone typed 'cvs
ci' *plus* the number of unique symbolic names in the CVS repository.

But for the latter kind of commit, we can't use our usual method of
deducing the atomic commit group, because the relevant file revisions
can be arbitrarily far apart and may not share author and log message.

So when in the revision history does a particular branch's creation
get inserted? To answer this question, we have to find all the
files/revs on which the branch was created. Each file/rev pair maps
to a range of Subversion revisions... "But wait! I thought we don't
know the Subversion revisions yet, because we haven't finished
deducing the branch creations?"

Okay, so let me roughly describe an algorithm here :-). I'm doing
this more as an indirect way of describing the problem itself, though
our solution may look something like this. But please don't take it
as anything but a tentative proposal -- there's probably a better way
to do it.

In the first pass, cvs2svn.py doesn't try to deduce any
branch-creation commits. It just groups 'cvs ci' commits, the same
way it already does. We already remember symbolic names as we parse
each RCS file -- that is, we remember what revision each branch is
rooted in. This information is recorded in cvs2svn-data.*revs.

Now suppose that branch B was created on revision X.Y for /A/D/G/pi.
What we need is a table indicating that for file /A/D/G/pi, revision
X.Y was committed in Subversion revision N, and that later on, X.(Y+1)
was committed in revision M. This range, N->M, is the range of
possible Subversion revisions from which branch B can be copied, for
this file.

Obviously, we can't have this table by the time we're done parsing
/A/D/G/pi, because we won't have any Subversion commits grouped until
we've parsed *all* the RCS files. We also can't accumulate the
information in memory by making a pass over 'cvs2svn-data.s-revs',
because there might be an arbitrary amount of time and data separating
X.Y from X.(Y+1), for a given file.

My solution to this is to use Python's `anydbm' interface as a backing
store for what would otherwise be an in-memory operation. Pass 4
would now become

Run over cvs2svn-data.s-revs. For each line that claims a branch
creation, make an entry in a new working file

cvs2svn-branches/BRANCH_NAME.dbm

The entry maps file paths to revision ranges

filepath --> (N M)

where N is the Subversion revision corresponding to the CVS revision
X.Y in which the branch was created for that filepath, and M is the
Subversion revision corresponding to X.(Y+1).

As we're running over cvs2svn-data.s-revs, we also remember every
unique branch name. (It's okay to hold all branch names in memory;
even the nastiest CVS repository will only have a few thousand
branches. We just can't hold every file path or every revision in
memory.)

At the end of this pass, we have every branch name in memory, and we
have a dbm file for each branch, indicating, for each filepath in the
repository, the range of Subversion revisions from which that branch
*could* be copied.

Now we just have to loop over the branches, finding the "ideal"
Subversion revision, that is, the revision which if used to create the
branch, will necessitate the smallest number of manual secondary
copies, perhaps even zero.

My off-the-top-of-my-head suggestion for that algorithm would be:

Map each branch name to a list of pairs, each pair mapping a
Subversion revision range to the number of files available to the
branch from those revisions:

BRANCH_NAME ==> [ ((N, M), COUNT), ...]

Where (N, M) is the revision range, and COUNT is the number of files
that could be gotten from that range.

The list of pairs starts out empty. To fill it, iterate over all the
entries in cvs2svn-copies/BRANCH_NAME.dbm. For the first file, add

((N, M), 1)

For the next file, increment the count if N and M are the same as the
previous entry, or else splice it in in the obvious way, keeping the
list of pairs ordered by N and M, such that a lower N comes first, and
within a given N, a higher M comes first (i.e., ranges are ordered,
with wider ranges earlier).

Notice that a given file may be counted in more than one pair; the
total of all COUNTs may be much higher than the total number of files
recorded in the .dbm file.

When all files have been processed, run over the list of pairs,
choosing one with the highest available count. We'll use a revision
from that range to make the branch copy, and then deal with missing
items and wrong-revision items by manually deleting and/or copying.

(If these lists get too big to store in memory, we can write them out
to cvs2svn-copies/BRANCH_NAME.ranges or something.)

Of course, once we start inserting these new branch-creation revisions
into the revision history, the revision ranges recorded in the .dbm
files will be wrong :-). So before we can commit anything, we have to
run over cvs2svn-data.s-revs again, now that we have the branch
creation information, and write out 'cvs2svn-data.commits or whatever,
which includes lines describing the branch commits. This means that
each time we insert a branch commit, we increment an accumulator, and
as we're writing out *next* branch commit, we add the current
accumulator value to the "source revision" for the copy. Otherwise,
we'll get further and further "off" as we go.

When it's all done, we have a .commits file. The final pass runs over
that file, making commits to the Subversion repository.

I've hand-waved on some details, of course, such as exact information
recorded in the .commits file. Or: sometimes the same name is tag in
one place but a branch in another, and Subversion has to correctly
split such trees between /tags and /branches. Etc. But I hope the
general idea is clear here.

Greg Stein also described a two-pass algorithm for discovering the
best-copy-revision during a phone call; I'm embarrassed to say I don't
remember it well enough to describe it here :-), but it wasn't
entirely dissimilar to the above.

So Marko: am I on the right page with all this? And is this stuff
you've already started, or were you going about things differently?

Thanks,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Mar 7 23:12:11 2003

This message: [ Message body ]
Next message: kfogel_at_collab.net: "release-0.19 branch passes svncheck"
Previous message: Michael Price: "apache2 HEAD heads up"
Next in thread: Greg Stein: "Re: cvs2svn and branches (long)"
Reply: Greg Stein: "Re: cvs2svn and branches (long)"
Reply: Marko Macek: "Re: cvs2svn and branches (long)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]