[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: cvs2svn

From: Greg Stein <gstein_at_lyra.org>
Date: 2001-04-16 20:40:37 CEST

On Mon, Apr 16, 2001 at 08:43:36AM -0700, Bob Miller wrote:
> Greg Stein wrote:
>...
> Another thing to think about: How will cvs2svn tell the repository,
> "This commit isn't happening now, it happened two years ago in
> March"? And what are the security implications of that?

At the moment, I believe the timestamp is merely a property on the revision
(there isn't an official timestamp now, and when asked a while back, jimb
said "it'll be a property"). With that in mind, we ought to be able to set
it arbitrarily.

> So what's an acceptable amount of time to convert a 9 Gb repository?
> One week? One day?

*blink* A week? Oh geez, no. I've been hoping to measure the thing using
hours for large repositories. I'd like to be able I/O bound on the process,
and we're looking at a couple scans over the CVS repository, some log
writing, and writing the new repository.

No idea what the time on that might be, though, but I'm hoping for "hours".
Small and medium repositories ought to be just a few minutes.

I believe the sorting of individual revisions into groups of commits will be
the slowest part. I'm sure they've optimized GNU sort quite a bit, but I've
got to believe it will shudder when fed a file hundreds of megabytes in
length. However, the primary key for that is a (hash, userid, time) tuple.
We can do a preliminary bin-sort on the hash, using an arbitrary number of
digits from it. For large repositories, you could end up dividing the
average log size using three hex digits, which maps to 4096 bins. Your
400meg log file is now just a bunch of 100k files. Pump each through
sort(1). The log scan process can then, effectively, do an insertion sort as
it reads the N log files for processing.

[ the bin corresponding to the hash of "no message" will be large :-) ]

*shrug* ... We'll see. The process will be decomposable into a number of
steps (using disk-based logs as the saved state), so we can work on them
individually until each step "feels right" in terms of algorithmic
complexity (read: maintainability for the conversion script) vs. acceptable
run time.

> > Oh, and knowing about RCS parsing is a "kind of"
> > requirement (there are both Perl and Python modules to do this, so the bar
> > is lowered for the parsing part).
>
> Don't use the Perl parser. Do read rcsfile(5).

I was referring to the cvsblame.pl script from the Bonsai package. It parses
the ,v file directly (per rcsfile(5)). Similarly, there is blame.py in the
Python world. (although I plan to refactor the latter)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:36:28 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.