[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: cvs2svn and branches (long)

From: Marko Macek <Marko.Macek_at_gmx.net>
Date: 2003-03-09 12:05:57 CET

Karl Fogel wrote:
> For a while now, I've been pondering the problem of how cvs2svn should
> map CVS tags/branches to Subversion tags/branches. It's definitely
> time to come out of the closet and discuss this on the list :-).
[...]
> Single File:
> ============
> $ cvs tag -b SMALL_BRANCH A/D/G/pi
> /branches/SMALL_BRANCH/A/D/G/pi
[...]

Agreed. And non-branch should go to /tags/.

> Multiple Files:
> ===============
>
[...]
> $ cvs tag -b MEDIUM_BRANCH A/D/G
[...]
> $ svn mkdir /trunk /branches/MEDIUM_BRANCH
> $ svn mkdir /trunk/A /branches/MEDIUM_BRANCH/A
> $ svn mkdir /trunk/A/D /branches/MEDIUM_BRANCH/A/D
> $ svn cp /trunk/A/D/G /branches/MEDIUM_BRANCH/A/D/G
[...]
Agreed.

> It's also possible to have situations where all but a few files are
> branched (in CVS, maybe the list of files was given explicitly, or
> maybe someone deleted the branch tag from certain files later). In
> such cases, we probably want to copy the highest possible parent
> directory, as usual, then *delete* the unbranched files from the new
> copy, before committing the txn.

In my current code, I decided not to do any deletes,
because it doesn't seem very clean. But it could work.

> All of this implies that before committing anything to Subversion,
> we'll need to gather the full set of CVS commits. (Is that where
> you're heading in your patches, re a "cvs2svn-data.commits" file?).

I was trying to separate the actual commit step that creates the
repository from the heuristics that generate the commit sets.
It is also easier to test/develop things if one can diff the
.commits files.

[...]
> So when in the revision history does a particular branch's creation
> get inserted? To answer this question, we have to find all the
> files/revs on which the branch was created. Each file/rev pair maps
> to a range of Subversion revisions... "But wait! I thought we don't
> know the Subversion revisions yet, because we haven't finished
> deducing the branch creations?"

Renumbering is easy, once you know where to insert the
copy operations. ;)

[...]
> In the first pass, cvs2svn.py doesn't try to deduce any
> branch-creation commits. It just groups 'cvs ci' commits, the same
> way it already does. We already remember symbolic names as we parse
> each RCS file -- that is, we remember what revision each branch is
> rooted in. This information is recorded in cvs2svn-data.*revs.

Yes

> The entry maps file paths to revision ranges
>
> filepath --> (N M)
>
> where N is the Subversion revision corresponding to the CVS revision
> X.Y in which the branch was created for that filepath, and M is the
> Subversion revision corresponding to X.(Y+1).

I suggest we ignore the memory usage problem for now because practice
it's not a problem except for extremely large repositories. We should
get the algorithm right first.

> At the end of this pass, we have every branch name in memory, and we
> have a dbm file for each branch, indicating, for each filepath in the
> repository, the range of Subversion revisions from which that branch
> *could* be copied.
>
> Now we just have to loop over the branches, finding the "ideal"
> Subversion revision, that is, the revision which if used to create the
> branch, will necessitate the smallest number of manual secondary
> copies, perhaps even zero.
>
> My off-the-top-of-my-head suggestion for that algorithm would be:
>
> Map each branch name to a list of pairs, each pair mapping a
> Subversion revision range to the number of files available to the
> branch from those revisions:
>
> BRANCH_NAME ==> [ ((N, M), COUNT), ...]
>
> Where (N, M) is the revision range, and COUNT is the number of files
> that could be gotten from that range.

> The list of pairs starts out empty. To fill it, iterate over all the
> entries in cvs2svn-copies/BRANCH_NAME.dbm. For the first file, add
>
> ((N, M), 1)
>
> For the next file, increment the count if N and M are the same as the
> previous entry, or else splice it in in the obvious way, keeping the
> list of pairs ordered by N and M, such that a lower N comes first, and
> within a given N, a higher M comes first (i.e., ranges are ordered,
> with wider ranges earlier).

One thing to be aware of: for nested branches or tags on branches, the
decision of when to copy the branch will affect the revisions above.
It may be necessary to run the algorithm several times, for each level
of tags. what complicates things is that tags (and therefore branches)
can tag files of different branches.

> Notice that a given file may be counted in more than one pair; the
> total of all COUNTs may be much higher than the total number of files
> recorded in the .dbm file.
>
> When all files have been processed, run over the list of pairs,
> choosing one with the highest available count. We'll use a revision
> from that range to make the branch copy, and then deal with missing
> items and wrong-revision items by manually deleting and/or copying.

You should also track the number of files which are in the revision but
must not be copied to the branch.

> (If these lists get too big to store in memory, we can write them out
> to cvs2svn-copies/BRANCH_NAME.ranges or something.)
>
> Of course, once we start inserting these new branch-creation revisions
> into the revision history, the revision ranges recorded in the .dbm
> files will be wrong :-). So before we can commit anything, we have to
> run over cvs2svn-data.s-revs again, now that we have the branch
> creation information, and write out 'cvs2svn-data.commits or whatever,
> which includes lines describing the branch commits. This means that
> each time we insert a branch commit, we increment an accumulator, and
> as we're writing out *next* branch commit, we add the current
> accumulator value to the "source revision" for the copy. Otherwise,
> we'll get further and further "off" as we go.
>
> When it's all done, we have a .commits file. The final pass runs over
> that file, making commits to the Subversion repository.
>
> I've hand-waved on some details, of course, such as exact information
> recorded in the .commits file. Or: sometimes the same name is tag in
> one place but a branch in another, and Subversion has to correctly
> split such trees between /tags and /branches. Etc. But I hope the
> general idea is clear here.

> Greg Stein also described a two-pass algorithm for discovering the
> best-copy-revision during a phone call; I'm embarrassed to say I don't
> remember it well enough to describe it here :-), but it wasn't
> entirely dissimilar to the above.
>
> So Marko: am I on the right page with all this? And is this stuff
> you've already started, or were you going about things differently?

I actually had something like the above, I'll see if I can find a
version that is in reasonably workable state. It does it all in memory
though.

But my current algorithm works differently, mainly because I wanted to
avoid delete operations when doing a tag copy and also reduce memory
usage. And because it is simpler (but in current implementation, pretty
slow).

It goes like this:

pass4: go over the .s-revs file and make a set of CVS (f,r) for each
tag/branch. Also separates tags from branches and determines the branch
/tag dependencies (for now it incorrectly assumes it is a tree). This is
kept in memory for now.

pass5: Generates a commits file like exactly like current cvs2svn. This
should probably be careful not to combine files/revs that are only
partially tagged, but it isn't yet.

pass6: Read the commits file and keep track of the current repository
state. After each commit, compare the state to all tags. We then have
several cases:

a) perfect match. Mark this tag as having a possible perfect copy.
This handles things like files being and then removed.

b) partial match. We have a match which has some files, but no extra
files. Remember the match if it's better than previous partial match.

c) no match. If there is no partial or perfect match, the tag will be
copied file by file.

pass7: Read the commits file again, now doing actual SVN commits.

a) Once a branch has been perfectly/partially copied, any extra files
are copied file by file.

b) Branch must be copied before there is any commit to it.

Right now it's pretty slow, because the matching is all brute force.
Caching some matching state would make it much faster.

Memory usage could be optimized by splitting pass6 and doing pass5 and
6.1 for each tag separately and store the results to a file.

Also, it should be resonably easy (but slower) to organize the files
into a directory tree instead of a list). This would make it possible
to handle subtree copies instead failing back to file-by-file copies.

I haven't actually decided which algorithm I prefer.
Initially, the second one would be made of simpler building
blocks and there easier to debug and more bulletproof (like the
current file-by-file is), but it didn't exactly turn out to be.
I'd still prefer not to do any deletes when copying branches/tags.
I'm also not exactly a fan of using dbm (text files would be more
debuggable).

Now on to some more technical CVS issues:

1. If one adds a file on the trunk and then adds the file with the same
name on the branch, the CVS revisions will be 1.1 for trunk, and 1.1.1.1
for branch. This looks like the file was branched, but actually it
wasn't. (This needs to be added in the test suite).

2. cvs import-ed repositories. When a cvs repository has been created
with CVS import, all the tags on unmodified files will be created for
revision 1.1.1.1, not revision 1.1. This is a major complication. My
code currently has a hack to remap version 1.1.1.1 to 1.1 and skip
vendor branch conversion, but this could be wrong if vendor branch was
imported later. (I have a test repository where someone imported his own
working copy... :)

Regards,
MArk

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Mar 9 11:58:06 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.