[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Tue, 12 Oct 2010 14:05:46 +0100

On Tue, 2010-10-12, Johan Corveleyn wrote:
> On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> >> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> >> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> >> > sub-structures, and simplifies all the code that uses them.
> >> > [...]
> >> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> >> So please go ahead and commit the refactoring.
> >> >
> >> > OK, committed in r1021282.
> >>
> >> Thanks, looks much more manageable now.
> >
> > I'd like to see a simplified version of your last patch, taking
> > advantage of that, before you go exploring other options.
>
> Yes, I'll certainly do that in the coming days.
>
> >> >> Also, maybe the last_chunk number could be included in the file_info
> >> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> >> every time as an extra argument. Seems like the sort of info that
> >> >> could be part of file_info.
> >> >
> >> > It looks to me like you only need to calculate it in exactly one place:
> >> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating
> >> > for efficiency, as it's a trivial calculation.
> >>
> >> Hm, I don't know. That's recalculating it for every byte in the
> >> prefix. In the files I'm testing there could be 1 Mb or more of
> >
> > No, no, not every byte - only 1 in 131072 of the calls need to calculate
> > it - only when it reaches the end of a block.
>
> Oh right. I guess it's ok then.
>
> > May I ask whether you've tested the code that crosses from one chunk to
> > another? I noticed the following, just now:
> >
> >> + *at_least_one_end_reached =
> >> + file_baton->curp[idx0] == file_baton->endp[idx0]
> >> + || file_baton->curp[idx1] == file_baton->endp[idx1];
> >> + }
> >> +
> >> + /* If both files reached their end (i.e. are fully identical),
> >> we're done */
> >> + if (file_baton->curp[idx0] == file_baton->endp[idx0]
> >> + && file_baton->curp[idx1] == file_baton->endp[idx1])
> >
> > Those tests seem to be saying "if we're at the end of the current chunk
> > then we're at the end of the file". That looks wrong. What do you
> > think?
>
> Well, it does work :-). The code is correct, but it's kind of
> difficult to see, so that's my fault.
>
> The reason is that increment_pointer_or_chunk takes care that curp
> never equals endp (if curp==endp-1 it reads the next chunk), unless it
> reaches end of file. See this snippet of increment_pointer_or_chunk:
> + if (*chunk_number == last_chunk_number)
> + (*curp)++; /* *curp == *endp with last chunk signals end of file */
>
> If you think this is too obscure to handle this, I agree, but I
> couldn't think of a better/easier way at that time. If you have any
> suggestions, feel free :-). But maybe I should refactor the patch
> first, so it's easier to see how this would work out best.
>
> The same "obscurity" also applies to reaching "beginning of file"
> (which can happen in decrement_pointer_or_chunk, either during prefix
> scanning (rolling back the first line) or during suffix scanning (if
> there was no prefix, but still a difference in the first line)). There
> it's done by setting the chunk number to -1, which is checked for in
> find_identical_prefix and find_identical_suffix:
> + if (*chunk_number == 0)
> + (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */
>
> Not ideal, but it works ...

Oh OK, thanks for explaining. No further comment, m'lord :-)

- Julian
Received on 2010-10-12 15:06:29 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.