I remember that RCS was "librarified" for use in CVS. Perhaps you
could use that library to parse the RCS files. I don't know how rich
its interfaces are, though.
It seems to me that the process should be something like:
1) Extract the log from every ,v file in every directory in the CVS
repository.
2) Sort all those entries by commit time, preserving the filename,
revision, and log entry.
There's no way to avoid building this huge sorted list, if you want
to be able to recognize commits made across several directories.
But it'll be big. If you don't want to keep it all in memory, you
could certainly put them in any database that supports in-order
traversal. Berkeley DB does, and it has a Perl interface.
3) Working your way from oldest to youngest, look at commits that
occur at approximately the same time that have approximately the
same log message --- each such group constitutes a single commit.
Figuring out exactly what "approximately" means will be an
interesting challenge. I think a time fuzz of at least twenty
minutes would be good, or however long a commit can take. Your log
entry fuzz should refuse to draw any comparison between trivial log
entries (empty or very short), to avoid grouping things into
commits that don't belong together. It should probably ignore
whitespace differences, etc. Cvs2cl has logic for this that people
like.
I'd suggest operating directly on the Subversion repository, using the
FS library. It'll be faster, and you'll have fewer components to
provide extraneous errors.
You'll need to recognize branches by comparing branch tags.
Challenging.
Received on Sat Oct 21 14:36:08 2006