I got pissed off enough at CVS today (I work on GCC constantly, and the
gcc development policy is now to create branches rather than keep large
experimental things on the mainline, but disabled. So I have about 8 or
nine trees checked out, and of course, updating/working with each one and
managing merges/etc is just not fun) that i figured i'd stop lurking and
do something useful towards getting rid of it.
To that end, and knowing python pretty well, I took the absolute latest
cvs2svn, and made it actually able to convert the mainline branch of gcc's
cvs repository (doing all the branches is actually only a few more lines
of code, to walk the descendant branches, do the right copies, etc).
Most of it was fighting swig and various bugs in the version needed (I was
using 1.3.10), that appear to cause it to ignore typemaps with the ignore
keyword (which is really supposed to ignore the argument passed in), and
not generate the code for them. Which of course, causes you to end up
with unitialized variables being passed to critical functions (say,
svn_fs_apply_textdelta).
Using a kludgy workaround, i was able to call the functions i needed to
work with the repository.
I didn't see a point in making cvs2svn create an xml file in this case,
since the XML file for GCC's repository would be huge, and i would
just be taking it, turning around, and using it to build the repository.
Which of course, because of berkeley's log file fun, fills over 10 meg a
second (No joke here, i have a script run that just db_archive |xargs rm
-f in a loop and sleeps for 2 seconds).
Easier to build it directly, in this case. Nothing i've done, however,
changes it so you couldn't quite easily produce an xml file (IE it only
commits transactions/works with the database in the Commit class)
I ripped out the CVSParser from viewcvs to get at the branch info/revision
log messages/etc, without having to complete the BuildRevision class,
which seemed to be modeled after the extract_revision function in there.
However, extracting revisions in python, in any case, is way too slow.
It's just not bearable if you do them in order (since it's O(n^2)).
The time is completely dominated by simply processing the diff
commands to get to a given revision. Particularly when you have
changelogs with 13000 revisions in them, as we do (and since the logs get
rotated every so often, it's worse)
I currently have it extract a given revision by piping to "co
-q -p<revision> <fname>", which is much faster.
The only way it's going to run at a reasonable speed in python is to hack
up the revision extractor to do them in order (since this is how we need
to commit them), by reverse applying the diff commands, and keeping the
fulltext of the previous revision we processed (keep a 20-30 item cache if
you think it's necessary to save memory, reparsing/reextracting a new
starting point fulltext if it's not in the fulltext cache)
So other than a speed issue, it works. It includes the author/log/date
attributes both deletes/commits, creating the directories as we see
them/the files as we see them, etc.
The only thing it's not doing is actually walking the branches descending
from a given revision (though the CVSParser gives us the info), executing
the right copy commands, voila.
If anyone wants them, I can post the diffs to what's there down.
However, in converting even a single directory of the gcc repository, it
appears subversion takes up an much larger amount of disk space
(not counting log files). Even more than i had figured it would.
Processing only the mainline revisions of the cp language
subdirectory of gcc, gives a subversion database of >200 meg.
This is starting from a freshly created repo.
By comparison, all the revisions of all the branches of the cp language
subdir, take up only 40 meg in CVS.
I would estimate that the entire conversion would take about 200-205 meg
(based on the number of branches there are/how many changes there are on
the branches)
Before someone asks, i'm not doing anything different than what putfile.py
did, and there isn't anything odd about how the repository looks (IE i'm
absolutely positive i could get the same thing using the command line
client), and it's definitely storing deltas, rather than fulltext.
The cp-branch doesn't have many branches with significant changes. In
fact, it only really has one (A C++ parser rewrite that's ongoing).
Most of the other language subdirs/the main gcc dir/testsuite
dirs/etc have many more active branches, with significant amounts of
changes.
gcc's cvs repository is 626 meg, all told.
Best case that would put the repository size at over 3 gig.
Worst case, probably 3.1 or 3.2 gig.
I expected a factor of 2.
a factor of 5 just seems a bit high.
I'm not concerned that much personally, , but i figured i better mention
it since it's sure to annoy some portion of subversion's possible user
base.
As I said, i'm happy to post the cvs2svn.py diffs/binding diffs so people
can test this stuff out.
GCC's cvs repository (IE not a checked out copy, the actual RCS files
and whatnot) is available by rsync from gcc.gnu.org::gcc-cvs if anyone
had a hankering to convert a large cvs repository.
Though, seriously, if you go have a fast disk, and just start it running,
you'll quickly be out of disk space from the log files, so make sure to
db_archive in a script in the background or something.
--Dan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:03 2006