[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 12 Oct 2010 00:31:25 +0200

On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> > So I wrote a patch - attached - that refactors this into an array of 4
>> > sub-structures, and simplifies all the code that uses them.
> [...]
>> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> So please go ahead and commit the refactoring.
>
> OK, committed in r1021282.

Thanks, looks much more manageable now.

>> Also, maybe the last_chunk number could be included in the file_info
>> struct? Now it's calculated in several places: "last_chunk0 =
>> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> every time as an extra argument. Seems like the sort of info that
>> could be part of file_info.
>
> It looks to me like you only need to calculate it in exactly one place:
> in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> for efficiency, as it's a trivial calculation.

Hm, I don't know. That's recalculating it for every byte in the
prefix. In the files I'm testing there could be 1 Mb or more of
prefix, so 1 million times this calculation (and that for every diff
in say 2000 diffs for a blame operation). Unless the compiler can
optimize that to a single calculation, because it never changes (I
don't know enough about compilers to know if this is possible), it
might have an impact, even if it's a trivial calculation.

OTOH, I might be in premature-optimization-land here, the impact might
be negligible. If I find some time I'll test it.

> And in a later email you wrote:
>> [...] Maybe I can give it a go,
>> first suffix then prefix, and see if I can find real-life problems ...
>
> There's no need to re-visit that.  I think it's fine as is it, using a
> separate pair of buffers.

I've been thinking some more about that. For relatively small files,
smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
Kb, it all fits in one chunk, yet I still read it twice, once for
suffix and once for prefix (and further token processing). Ok, caching
by the OS might make this insignificant, but still.

Since most common source files in repositories are probably smaller
than 128 Kb I might be hurting the common case, in cases where it's
not compensated by performance gain from identical prefix/suffix.

Then again, maybe this can be solved with a specific optimization,
without changing the current code too much: if file size is smaller
than chunk size, I point the suffix buffer to the prefix buffer and
I'm done.

Cheers,

-- 
Johan
Received on 2010-10-12 00:32:02 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.