I got pissed off enough at CVS today (I work on GCC constantly, and the 
gcc development policy is now to create branches rather than keep large 
experimental things on the mainline, but disabled.  So I have about 8 or 
nine trees checked out, and of course, updating/working with each one and 
managing merges/etc is just not fun) that i figured i'd stop lurking and 
do something useful towards getting rid of it.
To that end, and knowing python pretty well, I took the absolute latest 
cvs2svn, and made it actually able to convert the mainline branch of gcc's 
cvs repository (doing all the branches is actually only a few more lines 
of code, to walk the descendant branches, do the right copies, etc).
Most of it was fighting swig and various bugs in the version needed (I was 
using 1.3.10), that appear to cause it to ignore typemaps with the ignore 
keyword (which is really supposed to ignore the argument passed in), and 
not generate the code for them.  Which of course, causes you to end up 
with unitialized variables being passed to critical functions (say, 
svn_fs_apply_textdelta). 
Using a kludgy workaround, i was able to call the functions i needed to 
work with the repository.
I didn't see a point in making cvs2svn create an xml file in this case, 
since the XML file for GCC's repository would be huge, and i would 
just be taking it, turning around, and using it to build the repository.
Which of course, because of berkeley's log file fun, fills over 10 meg a 
second (No joke here, i have a script run that just db_archive |xargs rm 
-f in a loop and sleeps for 2 seconds).
Easier to build it directly, in this case.  Nothing i've done, however, 
changes it so you couldn't quite easily produce an xml file (IE it only 
commits transactions/works with the database in the Commit class)
I ripped out the CVSParser from viewcvs to get at the branch info/revision 
log messages/etc, without having to complete the BuildRevision class, 
which seemed to be modeled after the extract_revision function in there.
However, extracting revisions in python, in any case, is way too slow.
It's just not bearable if you do them in order (since it's O(n^2)).
The time is completely dominated by simply processing the diff 
commands to get to a given revision.  Particularly when you have 
changelogs with 13000 revisions in them, as we do (and since the logs get 
rotated every so often, it's worse)
I currently have it extract a given revision by piping to "co 
-q -p<revision> <fname>", which is much faster.
The only way it's going to run at a reasonable speed in python is to hack 
up the revision extractor to do them in order (since this is how we need 
to commit them), by reverse applying the diff commands, and keeping the 
fulltext of the previous revision we processed (keep a 20-30 item cache if 
you think it's necessary to save memory, reparsing/reextracting a new 
starting point fulltext if it's not in the fulltext cache)
So other than a speed issue, it works. It includes the author/log/date 
attributes both deletes/commits, creating the directories as we see 
them/the files as we see them, etc.
The only thing it's not doing is actually walking the branches descending 
from a given revision (though the CVSParser gives us the info), executing 
the right copy commands, voila.
If anyone wants them, I can post the diffs to what's there down.
However, in converting even a single directory of the gcc repository, it 
appears subversion takes up an much larger amount of disk space 
(not counting log files). Even more than i had figured it would. 
Processing only the mainline revisions of the cp language 
subdirectory of gcc, gives a subversion database of >200 meg.
This is starting from a freshly created repo.
By comparison, all the revisions of all the branches of the  cp language 
subdir, take up only 40 meg in CVS.
I would estimate that the entire conversion would take about 200-205 meg 
(based on the number of branches there are/how many changes there are on 
the branches)
Before someone asks, i'm not doing anything different than what putfile.py 
did, and there isn't anything odd about how the repository looks (IE i'm 
absolutely positive i could get the same thing using the command line 
client), and it's definitely storing deltas, rather than fulltext.
The cp-branch doesn't have many branches with significant changes. In 
fact, it only really has one (A C++ parser rewrite that's ongoing).
Most of the other language subdirs/the main gcc dir/testsuite 
dirs/etc have many more active branches, with significant amounts of 
changes.
gcc's cvs repository is 626 meg, all told.
Best case that would put the repository size at over 3 gig.
Worst case, probably 3.1 or 3.2 gig.
I expected a factor of 2.
a factor of 5 just seems a bit high.
I'm not concerned that much personally, , but i figured i better mention 
it since it's sure to annoy some portion of subversion's possible user 
base.
As I said, i'm happy to post the cvs2svn.py diffs/binding diffs so people 
can test this stuff out.
GCC's cvs repository (IE not a checked out copy, the actual RCS files 
and whatnot) is available by rsync from gcc.gnu.org::gcc-cvs if anyone 
had a hankering to convert a large cvs repository.
Though, seriously, if you go have a fast disk, and just start it running, 
you'll quickly be out of disk space from the log files, so make sure to 
db_archive in a script in the background or something.
--Dan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:03 2006