Greg Stein wrote:
> Absolutely. First version will just produce a monster log. If gsort can't
> handle it, then I'll do the bins. Mostly, that was me thinking aloud, "wow.
> do we have a solution in case gsort takes five days to sort that file?" My
> thought experiment leads me to answer, "yes." The best that gsort can do is
> N log N, but we can set things up to reduce N for people, so it would become
> M * (N/M log N/M).
The sort y'all are talking about is necessary for grouping checkins
into identifible changesets, right? In which case, couldn't you
divide up the search space by time (that is, order your bins by time)?
e.g. you'd have the 2001-Jan bin, the 2001-Feb bin, etc., and look for
groupings (i.e. sort) just checkins in those individual time slices.
Maybe a little fuzz on either side to detect a checkin that spans two
time slices.
Maybe everyone had already understood that and I'm restating the obvious. :)
The sort shouldn't be too bad, but coming up with that list of changes
for an entire repository is going to be a bear.
It may be useful to look at the Perforce import tool. They have a
list of problems/limitations with their bourne shell/awk importer
script here:
http://www.perforce.com/perforce/technotes/note031.html
Their script is on their ftp site, but with no copyright notices.
You can find a linke to it here:
http://www.perforce.com/perforce/loadsupp.html#conv
Although I would hesitate to look at the source if you intend to
work on cvs2svn (just my own paranoid opinion). I thought there
was a Perl import script as well, but I didn't find it after a few
minutes of looking around.
As an aside, if you want to try to import SourceForge or RH's devo,
I think it'd be good to do something useful in the face of a corrupt
RCS file. This isn't important for the subversion cvs2svn, but
larger, longer-lived repositories are almost guaranteed to have
some file corruption in them.
Jason
Received on Sat Oct 21 14:36:28 2006