On Fri, Aug 20, 2010 at 9:11 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>> After eliminating antivirus, and running with a release build instead
>> of a debug build, svn diff is just about on par with GNU diff. So this
>> eliminates the option of optimizing diff ...
> Unless ...
> For every diff during blame calculation, tokens (lines) are extracted
> and processed each time for the "original" and the "modified". This
> takes up most of the time of svn diff. However, the file that is
> playing the "modified" role in one step, will be the "original" in the
> next step of blame processing. So in theory we already have those
> tokens from the previous step, and don't have to extract/compare/...
> them again.
> If one could find a way to recycle the tokens from the "modified" of
> the previous diff into the tokens of the "original" of the next diff,
> that would probably make the diff part of the blame operation almost
> twice as fast. And since diffing still accounts for ~90% of blame time
> on the client, that should speed it up considerably.
> Sounds like a plan?
> I'll try to write some sort of POC for this idea soon, unless someone
> tells me it's a stupid idea :-).
Hm, after several attempts to get this working, it seems to be a dead
end. Whatever I do, I can't seem to avoid lifetime issues, dangling
pointers, tokens referring to files which are no longer there, ...
I first tried just to pass through the position_list (position_list
from the modified file) to the next diff run as the new
position_list. I did this by returning the position_list in an
output parameter, up to the call in blame.c!add_file_blame, and then
reusing that as input parameter for the next diff call (by stashing it
away in the file_rev_baton, until the next run). That works, but it
doesn't help, because the real power of the algorithm is in the tree,
in which all the tokens from both original and modified are sorted and
made unique (i.e. each unique token is represented only once in the
tree). After the tree is correctly set up, the rest of the algo is
very fast, because it only needs to compare token references to find
the longest common subsequence.
At the end of each run the tree is filled with a combination of the
tokens of original and modified, so I can't just reuse/recycle that,
because that would lead to a major memory leak (building up the tree
with all the tokens from all revisions of the entire blame operation).
And refilling a clean tree with the tokens already present in the
passed-on-position_list also doesn't work, because the token content
really needs to still be accessible (either in memory, or readable
from the original file), so that token_compare can work when adding
the new tokens from the new modified file ...
Anyway, I don't know if this still makes sense. I'm just noting it
down to order my thoughts, and maybe help someone else thinking about
this. I'm gonna let it rest now, because I seem to have hit a brick
wall with this.
Will focus my efforts on other approaches to speed up blame, such as
the fundamental changes of binary blaming, avoiding diffs, ... but I'm
not sure if I'll be able to (have the time to) climb the learning
curve for that.
Received on 2010-08-24 10:19:18 CEST