On Mon, Apr 16, 2001 at 02:13:50PM -0700, Jason Molenda wrote:
> The sort y'all are talking about is necessary for grouping checkins
> into identifible changesets, right?
> In which case, couldn't you
> divide up the search space by time (that is, order your bins by time)?
> e.g. you'd have the 2001-Jan bin, the 2001-Feb bin, etc., and look for
> groupings (i.e. sort) just checkins in those individual time slices.
> Maybe a little fuzz on either side to detect a checkin that spans two
> time slices.
When you're sorting the information into bins, you have no concept of a
group at that point, so you cannot see something spanning bins. You would
need to detect the case in the next step, where you're identifying groups.
But then you have the nasty situation that your changeset is in two
Much nicer to sort into bins such that a particular changeset is known to be
within a single bin.
But all of this is probably moot. It is premised on gsort rolling over on a
file hundreds of megabytes in size. 1) we don't know what final logs sizes
will be for large repositories, 2) we don't know if gsort truly barfs (in
fact, jimb just posted that he doesn't think it will).
Of all the points to discuss, this is probably the last one :-) (certainly,
the one to talk about after we get timing info on the first draft of the
> As an aside, if you want to try to import SourceForge or RH's devo,
> I think it'd be good to do something useful in the face of a corrupt
> RCS file. This isn't important for the subversion cvs2svn, but
> larger, longer-lived repositories are almost guaranteed to have
> some file corruption in them.
Absolutely. I'm thinking that we could note the problem and skip the file.
Once the initial scan is complete, the program would stop. The user could
then go back and correct errors and run the scan on that one file. Then you
go ahead and complete the rest of the conversion process.
I'm also hoping to characterize some "typical" corruption using the
SourceForge repositories. I can then bullet-proof against those. As I
mentioned, many of those have been moved from elsewhere, so they could have
many years of info in them (and, thus, lots of time to have become
corrupted). The RedHat trees are also old, so there is potential there.
Using the "gnuplot" tree from SF, I already found some weird ,v files. I
haven't updated the RCS file parser yet in ViewCVS, but I did apply some
patches to it to watch out for certain types of warnings from "co" and
"rlog". The parser *does* need to be updated, and it is also what we'll use
for cvs2svn, so everything is all Happy.
Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:36:28 2006