Re: Diffs to cvs2svn.py

From: Daniel Berlin <dan_at_dberlin.org>
Date: 2002-02-03 07:39:28 CET

On Sat, 2 Feb 2002, Daniel Berlin wrote:

> One new file (The LRU cache module, which isn't mine, and i just noticed
> has no author name in it, i'll add proper attribution later), and a diff.
>
> The bindings diffs are coming up next.
>
> For those interested in speed (which is, of course, probably the top
> concern), pass time for 1, 2, and 3 could be sped up by a faster rcs
> parser.
>
> Pass 4 can be sped up by using the faster rcs parser to get at the
> revision log messages (though this is only a constant factor speedup,
> since the cvsparser it's using it currently specifically ignoring the
> deltatext, and the cache hit rate is very high, unless you make
> completely random changes to your repository :P), and probably moreso by
> not forking to co, which would mean rewriting a revision extractor in
> c/swigging it, and/or using the reverse application i suggested so we don't get O(n^2)
> behavior. Forking to co still gives us O(n^2) behavior, it's just making
> the constant a whole lot lower. You'd have to reverse apply the diff
> commands to get O(n) behavior. I don't know if the constant in doing
> this in python would be low enough to beat out co, i'll investigate.

Of course, it now occurs to me that they chose an algorithm that can't be
reversed (IE you can't take a base text of 1.1, apply the diff for 1.2 to
it, and get 1.2), except for branches, where they already use the
branchpoint as the base text, and forward apply diffs

So the best you can do in terms of speed without decoding every revision
ahead of time in reverse order, which is completely infeasible (take
my changelog example with 13000 revisions, think about storing the
fulltext, which is 300k, 13000 times, is roughly 3.9 gig of memory
or diskspace) is probably to choose smarter starting points
I.E. rather than applying every diff from the head on down to 1.1 to get
1.1, then do it again to get 1.2, etc, you start by fully decoding 1/10th
of them in O(n) time (since you are working in the right order to get this behavior)
them, choose the one closest to the revision you want, and apply normally
till you get that revision. When you get past a fully decoded text's
revision, delete it, since we'll never need it again.

Implementing this in python makes it about as fast as co. Implementing it
in a C extension would probably beat co by as much as co beats the
original in python on files with large numbers of revisions.

--Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:03 2006

This message: [ Message body ]
Next message: Greg Stein: "Re: SSL required only for write access?"
Previous message: David Summers: "SSL required only for write access?"
In reply to: Daniel Berlin: "Diffs to cvs2svn.py"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]