[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: cvs2svn

From: Karl Fogel <kfogel_at_collab.net>
Date: 2001-04-16 20:54:09 CEST

Greg Stein <gstein@lyra.org> writes:
> At the moment, I believe the timestamp is merely a property on the revision
> (there isn't an official timestamp now, and when asked a while back, jimb
> said "it'll be a property"). With that in mind, we ought to be able to set
> it arbitrarily.

That's what I've been thinking, too. (Verbose way of saying "+1" :-) ).

> *blink* A week? Oh geez, no. I've been hoping to measure the thing using
> hours for large repositories. I'd like to be able I/O bound on the process,
> and we're looking at a couple scans over the CVS repository, some log
> writing, and writing the new repository.
> No idea what the time on that might be, though, but I'm hoping for "hours".
> Small and medium repositories ought to be just a few minutes.

Yeah. My instinct is that even doing the huge SourceForge repos
shouldn't take more than a day, if that long.

> I believe the sorting of individual revisions into groups of commits will be
> the slowest part. I'm sure they've optimized GNU sort quite a bit, but I've
> got to believe it will shudder when fed a file hundreds of megabytes in
> length. However, the primary key for that is a (hash, userid, time) tuple.
> We can do a preliminary bin-sort on the hash, using an arbitrary number of
> digits from it. For large repositories, you could end up dividing the
> average log size using three hex digits, which maps to 4096 bins. Your
> 400meg log file is now just a bunch of 100k files. Pump each through
> sort(1). The log scan process can then, effectively, do an insertion sort as
> it reads the N log files for processing.

Grewvy plan.

> [ the bin corresponding to the hash of "no message" will be large :-) ]
> *shrug* ... We'll see. The process will be decomposable into a number of
> steps (using disk-based logs as the saved state), so we can work on them
> individually until each step "feels right" in terms of algorithmic
> complexity (read: maintainability for the conversion script) vs. acceptable
> run time.

Yeah. It's also useful to have that metadata saved in case the
conversion gets interrupted for some reason...
Received on Sat Oct 21 14:36:28 2006

This is an archived mail posted to the Subversion Dev mailing list.