[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Things I learned making cvs2svn work today

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-02-03 03:03:46 CET

On Sat, Feb 02, 2002 at 03:18:44AM -0500, Daniel Berlin wrote:
> I got pissed off enough at CVS today (I work on GCC constantly, and the
> gcc development policy is now to create branches rather than keep large
> experimental things on the mainline, but disabled. So I have about 8 or
> nine trees checked out, and of course, updating/working with each one and
> managing merges/etc is just not fun)

One word: BLECK!

> that i figured i'd stop lurking and
> do something useful towards getting rid of it.
> To that end, and knowing python pretty well, I took the absolute latest
> cvs2svn, and made it actually able to convert the mainline branch of gcc's
> cvs repository (doing all the branches is actually only a few more lines
> of code, to walk the descendant branches, do the right copies, etc).

Woo! I look forward to this.

>...
> Easier to build it directly, in this case. Nothing i've done, however,
> changes it so you couldn't quite easily produce an xml file (IE it only
> commits transactions/works with the database in the Commit class)

The intent is to feed it straight into the repository, using the FS/repos
bindings (which it sounds like is what you did).

> I ripped out the CVSParser from viewcvs to get at the branch info/revision
> log messages/etc, without having to complete the BuildRevision class,
> which seemed to be modeled after the extract_revision function in there.

Yup. The "rcsparse" module refers to the one in ViewCVS.

> However, extracting revisions in python, in any case, is way too slow.
> It's just not bearable if you do them in order (since it's O(n^2)).
> The time is completely dominated by simply processing the diff
> commands to get to a given revision. Particularly when you have
> changelogs with 13000 revisions in them, as we do (and since the logs get
> rotated every so often, it's worse)

Ouch!

There is a new C++ -based ,v parser in the ViewCVS repository, but that will
only give you the delta commands faster. Processing the delta commands won't
be affected :-(

> I currently have it extract a given revision by piping to "co
> -q -p<revision> <fname>", which is much faster.

Yup. I can see that. Good call.

> The only way it's going to run at a reasonable speed in python is to hack
> up the revision extractor to do them in order (since this is how we need
> to commit them), by reverse applying the diff commands, and keeping the
> fulltext of the previous revision we processed (keep a 20-30 item cache if
> you think it's necessary to save memory, reparsing/reextracting a new
> starting point fulltext if it's not in the fulltext cache)

Forking a call to "co" isn't a big deal. The cvs2svn.py design was intended
to avoid forks if at all possible, on the assumption that manipulating the
,v and the SVN repos directly would be faster. But hey... if it isn't, then
do the darned fork :-)

That said, I would like to work on improving the extraction time of the
Python code.

> So other than a speed issue, it works.

But this is the key part :-)

> It includes the author/log/date
> attributes both deletes/commits, creating the directories as we see
> them/the files as we see them, etc.

Sweet...

> The only thing it's not doing is actually walking the branches descending
> from a given revision (though the CVSParser gives us the info), executing
> the right copy commands, voila.

This is going to be a good chunk of work. I've got a design percolating in
my head on how we'll extract the tags and build those in the SVN repository.
But that is round two, after the basic "history import" is functioning.

> If anyone wants them, I can post the diffs to what's there down.

Please!

>...
> Processing only the mainline revisions of the cp language
> subdirectory of gcc, gives a subversion database of >200 meg.
>...
> I would estimate that the entire conversion would take about 200-205 meg
> (based on the number of branches there are/how many changes there are on
> the branches)

Euh... is that second number off?

> Before someone asks, i'm not doing anything different than what putfile.py
> did,

putfile.py was put there specifically to show how to insert data into a
repository :-) Glad it was helpful...

>...
> gcc's cvs repository is 626 meg, all told.
> Best case that would put the repository size at over 3 gig.
> Worst case, probably 3.1 or 3.2 gig.
>
> I expected a factor of 2.
> a factor of 5 just seems a bit high.

There might be some DB tuning possible. e.g. maybe some padding sizes, index
creation, etc

> I'm not concerned that much personally, , but i figured i better mention

Me neither (disk is cheap), but the metrics are excellent to know. Thanks!

>...
> As I said, i'm happy to post the cvs2svn.py diffs/binding diffs so people
> can test this stuff out.

Please. I'd like to take a look and start folding the changes back into the
actual repository!

> GCC's cvs repository (IE not a checked out copy, the actual RCS files
> and whatnot) is available by rsync from gcc.gnu.org::gcc-cvs if anyone
> had a hankering to convert a large cvs repository.

ooh!

/me goes to jot down a note...

> Though, seriously, if you go have a fast disk, and just start it running,
> you'll quickly be out of disk space from the log files, so make sure to
> db_archive in a script in the background or something.

Roger that :-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:03 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.