Greg Stein wrote:
> I believe the sorting of individual revisions into groups of commits will be
> the slowest part. I'm sure they've optimized GNU sort quite a bit, but I've
> got to believe it will shudder when fed a file hundreds of megabytes in
> length. However, the primary key for that is a (hash, userid, time) tuple.
> We can do a preliminary bin-sort on the hash, using an arbitrary number of
> digits from it. For large repositories, you could end up dividing the
> average log size using three hex digits, which maps to 4096 bins. Your
> 400meg log file is now just a bunch of 100k files. Pump each through
> sort(1). The log scan process can then, effectively, do an insertion sort as
> it reads the N log files for processing.
Sort by time as primary key. You want to build the SVN repository in
chronological order, anyway. As you traverse the sequence of CVS
commits in chrono order, group those which match the grouping
heuristic into a single SVN commit.
Bob Miller K<bob>
kbobsoft software consulting
Received on Sat Oct 21 14:36:28 2006