And note that the copyright on cvs2svn doesn't have to be the same as
the copyright on SVN itself, because it's an independent program on
which the rest of Subversion does not depend. If it helps a lot to
use GPL'd code, such as the librarified RCS in CVS, then that should
be okay.
-K
Jim Blandy <jimb@savonarola.red-bean.com> writes:
> I remember that RCS was "librarified" for use in CVS. Perhaps you
> could use that library to parse the RCS files. I don't know how rich
> its interfaces are, though.
>
> It seems to me that the process should be something like:
>
> 1) Extract the log from every ,v file in every directory in the CVS
> repository.
>
> 2) Sort all those entries by commit time, preserving the filename,
> revision, and log entry.
>
> There's no way to avoid building this huge sorted list, if you want
> to be able to recognize commits made across several directories.
> But it'll be big. If you don't want to keep it all in memory, you
> could certainly put them in any database that supports in-order
> traversal. Berkeley DB does, and it has a Perl interface.
>
> 3) Working your way from oldest to youngest, look at commits that
> occur at approximately the same time that have approximately the
> same log message --- each such group constitutes a single commit.
>
> Figuring out exactly what "approximately" means will be an
> interesting challenge. I think a time fuzz of at least twenty
> minutes would be good, or however long a commit can take. Your log
> entry fuzz should refuse to draw any comparison between trivial log
> entries (empty or very short), to avoid grouping things into
> commits that don't belong together. It should probably ignore
> whitespace differences, etc. Cvs2cl has logic for this that people
> like.
>
> I'd suggest operating directly on the Subversion repository, using the
> FS library. It'll be faster, and you'll have fewer components to
> provide extraneous errors.
>
> You'll need to recognize branches by comparing branch tags.
> Challenging.
Received on Sat Oct 21 14:36:09 2006