[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: cvs2svn and branches (long)

From: Greg Stein <gstein_at_lyra.org>
Date: 2003-03-08 00:52:59 CET

On Fri, Mar 07, 2003 at 03:34:41PM -0600, Karl Fogel wrote:
>...
> The logic for branching a set of files is similar to the single-file
> logic. Of course, in CVS, you can't distinguish between a directory
> being branched and the files in it *individually* being branched, or
> some subset of them being branched... which leads us to some
> interesting choices. Plus there's the issue of mixed-revision
> branches (possible when the user did 'cvs tag' instead of 'cvs rtag').

(or the user moved a tag manually)

>...
> And we'll have to consider the *creation* of branches and tags to be
> like CVS commits, in the sense that they map to Subversion commits,
> even though in CVS these actions do not involve running 'cvs ci'.

Right.

I just thought of something, too. You'll want to watch out for the case
where a branch appears to be created *after* a commit on the branch occurs.
There are a couple ways that this can happen:

1) They are logically separate branches. The second branch was constructed
   after the first was set up and received some commits.

2) Somebody manually moved the branch base for some item to a revision of
   that item which occurs after a commit to some of the other branch items.

I'm not sure what the solution would be (beyond flagging an error and
ignoring the branch). You could probably just create two branches,
partitioned between the "original" branch and the items that created the
second branch. For case (2), this implies that a given commit (after the
second branch was created) might actually occur on both branches.

>...
> In the first pass, cvs2svn.py doesn't try to deduce any
> branch-creation commits. It just groups 'cvs ci' commits, the same
> way it already does. We already remember symbolic names as we parse
> each RCS file -- that is, we remember what revision each branch is
> rooted in. This information is recorded in cvs2svn-data.*revs.

Don't get pegged in by that. Storing the information in a different form,
different file, with different info, or a different sort order is all open
for change.

> Now suppose that branch B was created on revision X.Y for /A/D/G/pi.
> What we need is a table indicating that for file /A/D/G/pi, revision

I'm not sure this assumption is valid. While trying to figure out when a
branch is created, it doesn't feel like you actually need to know
information about the files -- only what source revisions you'll need. It
may be that when you go to construct the branch, the processing will be some
streamy thing that scans over the inputs and figures out "for file
/A/D/G/pi, I'll get it from revision R."

> X.Y was committed in Subversion revision N, and that later on, X.(Y+1)
> was committed in revision M. This range, N->M, is the range of
> possible Subversion revisions from which branch B can be copied, for
> this file.

Right.

> Obviously, we can't have this table by the time we're done parsing
> /A/D/G/pi, because we won't have any Subversion commits grouped until
> we've parsed *all* the RCS files. We also can't accumulate the
> information in memory by making a pass over 'cvs2svn-data.s-revs',
> because there might be an arbitrary amount of time and data separating
> X.Y from X.(Y+1), for a given file.

If you lose the file assocations, and only keep the range sets for each
branch, then you might be able to keep this in memory.

(but the question about restartable passes will arise; if you compute info
in pass 3, to be used in pass 4, then you break the ability to rerun cvs2svn
starting at pass 4)

> My solution to this is to use Python's `anydbm' interface as a backing
> store for what would otherwise be an in-memory operation. Pass 4
> would now become

I do not see a need for anydbm, for a couple reasons:

1) I'm not sure you need the file assocations
2) I don't see the need for data lookup based on the filename

e.g. a flat, streamy file could probably work here, too

>...
> As we're running over cvs2svn-data.s-revs, we also remember every
> unique branch name. (It's okay to hold all branch names in memory;

Except for the restartable thing. It may be that at the end of a pass, you
dump the memory to a disk file. If you start at a later pass, then you can
simply restore from that disk file.

> even the nastiest CVS repository will only have a few thousand
> branches. We just can't hold every file path or every revision in
> memory.)

Right on every file path. Not sure what you have in mind with "every
revision" however.

> At the end of this pass, we have every branch name in memory, and we
> have a dbm file for each branch, indicating, for each filepath in the
> repository, the range of Subversion revisions from which that branch
> *could* be copied.
>
> Now we just have to loop over the branches, finding the "ideal"
> Subversion revision, that is, the revision which if used to create the
> branch, will necessitate the smallest number of manual secondary
> copies, perhaps even zero.

Yup, and this algorithm is independent of the file names, which is why I
suggest losing that part of the data structure and the need for the dbm.

>...
> I've hand-waved on some details, of course, such as exact information
> recorded in the .commits file. Or: sometimes the same name is tag in
> one place but a branch in another, and Subversion has to correctly
> split such trees between /tags and /branches. Etc. But I hope the
> general idea is clear here.

Yup.

> Greg Stein also described a two-pass algorithm for discovering the
> best-copy-revision during a phone call; I'm embarrassed to say I don't
> remember it well enough to describe it here :-), but it wasn't
> entirely dissimilar to the above.

Your algorithm increases the count in all appropriate ranges for a given
file. In mine, a file would only go into one range. This implies that the
algorithm might select the "wrong" one (it went into a range referring to 20
items rather than the big 1000 item range). The second pass would shift
items into the larger range. Your algorithm does it in one pass.

It should also be said that the ideal branching would account for directory
copies. In the following scenario, I'll show how the proposed algorithm
could perform the wrong branch assembly operations. I think we should stick
with the algorith, though, as it will *generally* work well, but I believe
we should also consider how the algorithm could be replaceable by somebody
with way too much time on their hands to correct it :-)

Let's say that you isolated the branch creation down to two source
revisions:

  (A) rev 103, picking up 1000 items
  (B) rev 107, picking up 10 items

Let's also say that /trunk/some/dir is a directory with 11 items in it --
one item from bucket (A) (named "fname0") and the other ten ("fname1"
through "fname10") are all (B). The algorithm would perform the following
operations:

  svn cp /trunk@103 /branches/BRANCH
  svn cp /trunk/some/dir/fname1@107 /branches/BRANCH/some/dir
  ...
  svn cp /trunk/some/dir/fname10@107

But the ideal behavior is:

  svn cp /trunk@103 /branches/BRANCH
  svn cp /trunk/some/dir@107 /branches/BRANCH/some/dir
  svn cp /trunk/some/dir/fname0@103 /branches/BRANCH/some/dir

Beats the crap out of me how to do *that* algorithm :-), but I wanted to
describe the scenario so that we can document the potential occurrence.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Mar 8 00:47:51 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.