On Mon, Mar 22, 2010 at 11:51 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Sun, 2010-03-21, Johan Corveleyn wrote:
[ ... ]
>> My idea for client-side improvement:
>> The client, while receiving binary diffs for successive revisions, could
>> calculate a byte-wise annotation of the file. (It probably would not have to
>> reconstruct full texts to do that.) Then, at the end, it could split the
>> byte-wise annotation at the file's line boundaries and report the highest
>> revision number within each line.
> Hi Johan.
> Please accept my sincere apologies for not having replied to your
> private email about this. I remember discussing this with you but have
> not thought more about it recently.
No problem (I wasn't sure you'd pick up my private email, so mailing
the list was plan B anyway :-)). It's probably better this way, all
the info directly on the list. Thanks a bunch for your explanation of
the algorithm. It makes a lot of sense. Now I just need to implement
Small disclaimer: I hope I can get this rolling, but progress might be
slow at times (and my email responses as well). That's because I'll
have to work on this after I've put my kids to bed, and before my
brain starts to shut down :-) ...
[ ... snipped the actual algorithm description ... first have to chew
a little bit on it ... ]
>> I might also take a look at Bert's suggestion in
>> (reversing the algorithm).
>> If anyone could help me get started with this (more specifics about
>> the above idea(s), or other ideas that could give a significant boost;
>> any pointers, parts of the code where I could get started, ...), that
>> would be very much appreciated.
> The important first step is to do some profiling to find out how much
> time is being spent -
> - on the server, calculating full texts from the stored diffs?
> - transmitting data over the network?
> - on the client, producing diffs?
> - on the client, producing blame info from diffs?
> I've just re-read your message from last year
> <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2079688> in which you report it's mainly client-side CPU bound. That's good to know. (That thread is linked from the issue.) So next, you would be well advised to profile the execution of the client code and see whether the textual diffs or some other part of the processing are taking the most time.
I'll certainly do some more profiling. I'm not too sure anymore about
my findings from last year ("mainly client-side cpu bound"). The
client is certainly a big bottleneck (client cpu is working it's ass
off most of the time), but I think there is more server-boundedness
than I thought at first.
> The clear advantage of the binary diff algorithm is that on the client
> it neither calculated diffs nor even looks at the older revisions of the
> file's text, so that might be a big win if it works at all.
> As regards reversing the current algorithm, first you might want to find
> out what is the oldest revision number that shows up in the blame of
> your typical file, and compare that with all the changes in the file's
> history. Is it only reporting the most recent 1% of the file's changes,
> in which case the reversed algorithm could achieve a 100-fold speed-up,
> or 20%, or 80%?
Thinking more about it, I don't think that reversing the algorithm and
stopping early will buy us much. I believe that in most cases
(certainly if you ignore whitespace changes, which always made the
most sense to me), you'll often have to go back to the earliest of
revisions, even if it's just for that one single header line which has
never changed since day 1.
In the case of our XML file, I'm sure that the "xml declaration" (the
line with "<?xml version="1.0"?>"), and other things like the root tag
have not changed since the first addition of the file. I think the
same may be true for traditional source files (for Java classes, the
"public class Bla" line might still be the original one (if the class
wasn't renamed); then again, if someone adds extends or implements it
might not; but the last "}"-line probably will).
Anyway, I'll first try to dig in to the binary blaming. If I get that
far, I'll see if it's still worth it to go for other optimizations ...
Received on 2010-03-22 21:59:16 CET