[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Tue, 12 Oct 2010 11:10:53 +0100

On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> > sub-structures, and simplifies all the code that uses them.
> > [...]
> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> So please go ahead and commit the refactoring.
> >
> > OK, committed in r1021282.
>
> Thanks, looks much more manageable now.

I'd like to see a simplified version of your last patch, taking
advantage of that, before you go exploring other options.

> >> Also, maybe the last_chunk number could be included in the file_info
> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> every time as an extra argument. Seems like the sort of info that
> >> could be part of file_info.
> >
> > It looks to me like you only need to calculate it in exactly one place:
> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating
> > for efficiency, as it's a trivial calculation.
>
> Hm, I don't know. That's recalculating it for every byte in the
> prefix. In the files I'm testing there could be 1 Mb or more of

No, no, not every byte - only 1 in 131072 of the calls need to calculate
it - only when it reaches the end of a block.

May I ask whether you've tested the code that crosses from one chunk to
another? I noticed the following, just now:

> + *at_least_one_end_reached =
> + file_baton->curp[idx0] == file_baton->endp[idx0]
> + || file_baton->curp[idx1] == file_baton->endp[idx1];
> + }
> +
> + /* If both files reached their end (i.e. are fully identical),
> we're done */
> + if (file_baton->curp[idx0] == file_baton->endp[idx0]
> + && file_baton->curp[idx1] == file_baton->endp[idx1])

Those tests seem to be saying "if we're at the end of the current chunk
then we're at the end of the file". That looks wrong. What do you
think?

In order to test this more easily than creating huge files, I wonder if
you can change CHUNK_SIZE and CHUNK_SHIFT to something much smaller such
as 8 bytes. I'm trying this now - not with your patch, first, but with
just a plain trunk build, to see if it works.

- Julian

> > And in a later email you wrote:
> >> [...] Maybe I can give it a go,
> >> first suffix then prefix, and see if I can find real-life problems ...
> >
> > There's no need to re-visit that. I think it's fine as is it, using a
> > separate pair of buffers.
>
> I've been thinking some more about that. For relatively small files,
> smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
> Kb, it all fits in one chunk, yet I still read it twice, once for
> suffix and once for prefix (and further token processing). Ok, caching
> by the OS might make this insignificant, but still.
>
> Since most common source files in repositories are probably smaller
> than 128 Kb I might be hurting the common case, in cases where it's
> not compensated by performance gain from identical prefix/suffix.
>
> Then again, maybe this can be solved with a specific optimization,
> without changing the current code too much: if file size is smaller
> than chunk size, I point the suffix buffer to the prefix buffer and
> I'm done.
Received on 2010-10-12 12:11:38 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.