Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Mon, 11 Oct 2010 10:53:53 +0100

On Sat, 2010-10-09, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> Ok, third iteration of the patch in attachment. It passes make check.
> >>
> >> As discussed in [1], this version keeps 50 lines of the identical
> >> suffix around, to give the algorithm a good chance to generate a diff
> >> output of good quality (in all but the most extreme cases, this will
> >> be the same as with the original svn_diff algorithm).
> >>
> >> That's about the only difference with the previous iteration. So for
> >> now, I'm submitting this for review. Any feedback is very welcome :-).
> >
> > Hi Johan.
>
> Hi Julian,
>
> Thanks for taking a look.
>
> > I haven't reviewed it, but after seeing today's discussion I had just
> > scrolled quickly through the previous version of this patch. I noticed
> > that the two main functions - find_identical_suffix and
> > find_identical_suffix - are both quite similar (but not quite similar
> > enough to make them easily share implementation) and both quite long,
> > and I noticed you wrote in an earlier email that you were finding it
> > hard to make the code readable. I have a suggestion that may help.
[...]
> > So I wrote a patch - attached - that refactors this into an array of 4
> > sub-structures, and simplifies all the code that uses them.
[...]
> Yes, great idea! That would indeed vastly simplify a lot of the code.
> So please go ahead and commit the refactoring.

OK, committed in r1021282.

> Also, maybe the last_chunk number could be included in the file_info
> struct? Now it's calculated in several places: "last_chunk0 =
> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> every time as an extra argument. Seems like the sort of info that
> could be part of file_info.

It looks to me like you only need to calculate it in exactly one place:
in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating
for efficiency, as it's a trivial calculation.

> One more thing: you might have noticed that for find_identical_suffix
> I use other buffers, chunks, curp's, endp's, ... than for the prefix.
> For prefix scanning I can just use the stuff from the diff_baton,
> because after prefix scanning has finished, everything is buffered and
> pointing correctly for the normal algorithm to continue (i.e.
> everything points at the first byte of the first non-identical line).
> For suffix scanning I need to use other structures (newly alloc'd
> buffer etc), so as to not mess with those pointers/buffers from the
> diff_baton.

> Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

> So: I think I'll need the file_info struct to be available out of the
> diff_baton_t struct as well, so I can use this in suffix scanning
> also.

Yes; I gave the struct type a name - "struct file_info" - so you can use
it in other places.

> (side-note: I considered first doing suffix scanning, then prefix
> scanning, so I could reuse the buffers/pointers from diff_baton all
> the time, and still have everything pointing correctly after
> eliminating prefix/suffix. But that could give vastly different
> results in some cases, for instance when original file is entirely
> identical to both the prefix and the suffix of the modified file. So I
> decided it's best to stick with first prefix, then suffix).

And in a later email you wrote:
> [...] Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

There's no need to re-visit that. I think it's fine as is it, using a
separate pair of buffers.

> > Responding to some of the other points you mentioned in a much earlier
> > mail:
> >
> >> 3) It's only implemented for 2 files. I'd like to generalize this for
> >> an array of datasources, so it can also be used for diff3 and diff4.
> >>
> >> 4) As a small hack, I had to add a flag "datasource_opened" to
> >> token.c#svn_diff__get_tokens, so I could have different behavior for
> >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> >> no longer needed.
> >
> > Yes, I'd like to see 3), and so hack 4) will go away.
>
> I'm wondering though how I should represent the datasources to pass
> into datasources_open. An array combined with a length parameter?
> Something like:
>
> static svn_error_t *
> datasources_open(void *baton, apr_off_t *prefix_lines,
> svn_diff_datasource_e[] datasources, int datasources_len)
>
> ? And then use for loops everywhere I now do things twice for the two
> datasources?

I haven't looked at how datasources_open() API is used or what form of
API would be best.

For the internal functions, you can now pass around an array of
file_info pointers rather than indexes.

> >> 5) I've struggled with making the code readable/understandable. It's
> >> likely that there is still a lot of room for improvement. I also
> >> probably need to document it some more.
> >
> > You need to write a full doc string for datasources_open(), at least.
> > It needs especially to say how it relates to datasource_open() - why
> > should the caller call this function rather than that function, and are
> > they both necessary - and how the caller is supposed to use the
> > 'prefix_lines' parameter. And doc strings for
> > increment/decrement_...().
> >
> > But this makes me think, it looks to me like this whole
> > prefix-suffix-skipping functionality would fit better inside the
> > lower-level diff algorithm rather than inside the "datasources_open"
> > function. Do you agree?
> >
> > It works as it is, but you were talking about wanting it to obey the
> > standard token-parsing rules such as "ignore white space", and so on.
> > It seems that things like this would be much better one level down.
>
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).

I expect you're right. You've studied this; I haven't.

> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

My gut feeling is that's probably not the way to go.

I'll respond separately to your follow-up email about this.

> >> 6) I've not paid too much attention to low level optimizations, so
> >> here also there's probably room for improvement, which may be
> >> significant for the critical loops.
> >
> > Maybe. Not important at this stage.
> >
> >> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> >> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> >> should probably make all "chunks" of type apr_off_t instead of int
> >> (but I have not done that yet, because the original implementation
> >> also used int for the chunk in the svn_diff__file_baton_t struct).
> >> Should I do this (also in the baton struct)? Or should I use an
> >> explicit cast somewhere?
> >
> > I recommend using an explicit cast where needed, in this patch.
> > Changing the 'chunk' data type everywhere could be done in a separate
> > patch, either before or after this patch.
>
> Ok, will do.
>
> One more thing I was wondering about: maybe increment_... and
> decrement_... should (eventually) be implemented as macro's? In fact,
> they are mostly just replacements for curp++ and curp-- but taking
> into account that we have chunks. Just a thought ...

If profiling shows that the function-call overghead is too high, then I
suggest that you make just the simple part of it a macro, like this: "if
((f)->curp < (f)->endp - 1) (f)->curp++; else function_call(...);".

> Thanks again for your input.

My pleasure.

- Julian
Received on 2010-10-11 11:54:35 CEST

This message: [ Message body ]
Next message: Julian Foad: "Re: Problem with external file"
Previous message: Bert Huijben: "RE: svn commit: r1004286 - in /subversion/trunk: ./ subversion/libsvn_subr/io.c"
In reply to: Johan Corveleyn: "Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster"
Next in thread: Johan Corveleyn: "Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster"
Reply: Johan Corveleyn: "Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]