On Friday, June 27, 2003, at 4:49 PM, Greg Stein wrote:
> On Fri, Jun 27, 2003 at 01:43:26PM -0400, Daniel Berlin wrote:
>> On Friday, June 27, 2003, at 12:41 PM, mark benedetto king wrote:
>>> On Fri, Jun 27, 2003 at 10:14:16AM -0500, kfogel@collab.net wrote:
>>>> which I'm pretty sure does not support the extended RCS file
>>>> definition.
>
> rcsparse does not support the "extended" RCS definition. I prefer
> standards
> over "extensions". My standard knee-jerk is to say "screw the CVSNT
> guys. go
> get your tools from them if they're going to monkey the format." But
> the
> more reasonable side of me says to temper that :-)
:P
>
>>>> Maybe it should, but then we'd probably also have to get
>>>> an improved version of RCS 'co', or find some other means to obtain
>>>> specific revisions from a ,v file...
>
> We've already got Python code to fetch fulltexts. But...
>
>> ...
>> The reason we use it is because it was *much* faster than doing the
>> delta application in python the inefficient way (applying deltas one
>> by
>> one), which is what cvs2svn did originally.
>
> Right. I had that in there, and Dan changed it over to 'co' :-)
> (about a
> year ago, when he provided the monster patch to take cvs2svn the last
> mile
> to building a repository)
This was because it took *minutes* to apply the deltas for some of the
gcc cvs files, and co took 2 seconds, IIRC.
>
>> ...
>>> Instead, I used a slightly modified rcsparse to extract not only the
>>> change
>>> metadata, but the deltas themselves, and the fulltext of the HEAD.
>>>
>>> I took HEAD and picked up the deltas in reverse order, reconstructing
>>> all of the fulltexts in N passes (there were no branches in these
>>> rcs files).
>
> I'll note that one of Daniel's patches did this -- caching the
> fulltexts as
> they were built. But I didn't fold in that part. Eventually, the
> switch to
> 'co' was made, and we just stopped worrying about the caching.
Right.
>
>>> This gave me a tremendous speedup, but wouldn't it also allow us to
>>> remove the requirement for "co"?
>>
>> Only if it's faster than co.
>
> Right. I know from tests that 'co' is *very* much faster than any
> algorithm
> based around rcsparse. Even with the 'tparse' parser plugged under
> rcsparse,
> we can't assemble the fulltexts as fast as co.
I actually came to a conclusion about this, if you remember, that it
was because of string copying in python due to immutable strings, that
it was so slow. The profiler pegged the lines doing the actual
subtract + add stuff as the time hogs.
Recently, I had the same type of fun when i needed to fart around with
large bug reports that were stored as text files (I wrote the python
script to convert gcc from GNATS to Bugzilla, after rewriting the perl
script that is in the contrib dir of Bugzilla, and determining it was
horrific to maintain that way, because of the number of changes i had
to make).
Because of the way GNATS fields work, you end up doing a line by line
walkthrough of the file, and for multiline fields, append to the
current piece of text until you hit a new field delimiter.
The relevance is that this is the same type of operation that we were
doing for applying the deltas, i.e. string appending.
Using python strings, it took *forever*, because += makes copies.
However, i discovered (though it's actually a well known thing on the
python mailing lists), that you can use the array module to do mutable
strings *very* fast.
The time to parse the fields for one large bug report went from 45
seconds to 0.5 seconds.
Thus, it may be that by using the array module instead of regular
python strings, it may be feasible to do assembly and application in
python.
--Dan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Jun 27 23:04:24 2003