[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sat, 9 Oct 2010 14:21:09 +0200

On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> Ok, third iteration of the patch in attachment. It passes make check.
>>
>> As discussed in [1], this version keeps 50 lines of the identical
>> suffix around, to give the algorithm a good chance to generate a diff
>> output of good quality (in all but the most extreme cases, this will
>> be the same as with the original svn_diff algorithm).
>>
>> That's about the only difference with the previous iteration. So for
>> now, I'm submitting this for review. Any feedback is very welcome :-).
>
> Hi Johan.

Hi Julian,

Thanks for taking a look.

> I haven't reviewed it, but after seeing today's discussion I had just
> scrolled quickly through the previous version of this patch.  I noticed
> that the two main functions - find_identical_suffix and
> find_identical_suffix - are both quite similar (but not quite similar
> enough to make them easily share implementation) and both quite long,
> and I noticed you wrote in an earlier email that you were finding it
> hard to make the code readable.  I have a suggestion that may help.
>
> I think the existing structure of the svn_diff__file_baton_t is
> unhelpful:
> {
>  const svn_diff_file_options_t *options;
>  const char *path[4];
>
>  apr_file_t *file[4];
>  apr_off_t size[4];
>
>  int chunk[4];
>  char *buffer[4];
>  char *curp[4];
>  char *endp[4];
>
>  /* List of free tokens that may be reused. */
>  svn_diff__file_token_t *tokens;
>
>  svn_diff__normalize_state_t normalize_state[4];
>
>  apr_pool_t *pool;
> } svn_diff__file_baton_t;
>
> All those array[4] fields are logically related, but this layout forces
> the programmer to address them individually.
>
> So I wrote a patch - attached - that refactors this into an array of 4
> sub-structures, and simplifies all the code that uses them.
>
> I think this will help you to get better code clarity because then your
> increment_pointer_or_chunk() for example will be able to take a single
> pointer to a file_info structure instead of a lot of pointers to
> individual members of the same.
>
> Would you take a look and let me know if you agree.  If so, I can commit
> the refactoring straight away.

Yes, great idea! That would indeed vastly simplify a lot of the code.
So please go ahead and commit the refactoring.

Also, maybe the last_chunk number could be included in the file_info
struct? Now it's calculated in several places: "last_chunk0 =
offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
every time as an extra argument. Seems like the sort of info that
could be part of file_info.

One more thing: you might have noticed that for find_identical_suffix
I use other buffers, chunks, curp's, endp's, ... than for the prefix.
For prefix scanning I can just use the stuff from the diff_baton,
because after prefix scanning has finished, everything is buffered and
pointing correctly for the normal algorithm to continue (i.e.
everything points at the first byte of the first non-identical line).
For suffix scanning I need to use other structures (newly alloc'd
buffer etc), so as to not mess with those pointers/buffers from the
diff_baton.

So: I think I'll need the file_info struct to be available out of the
diff_baton_t struct as well, so I can use this in suffix scanning
also.

(side-note: I considered first doing suffix scanning, then prefix
scanning, so I could reuse the buffers/pointers from diff_baton all
the time, and still have everything pointing correctly after
eliminating prefix/suffix. But that could give vastly different
results in some cases, for instance when original file is entirely
identical to both the prefix and the suffix of the modified file. So I
decided it's best to stick with first prefix, then suffix).

> Responding to some of the other points you mentioned in a much earlier
> mail:
>
>> 3) It's only implemented for 2 files. I'd like to generalize this for
>> an array of datasources, so it can also be used for diff3 and diff4.
>>
>> 4) As a small hack, I had to add a flag "datasource_opened" to
>> token.c#svn_diff__get_tokens, so I could have different behavior for
>> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
>> no longer needed.
>
> Yes, I'd like to see 3), and so hack 4) will go away.

I'm wondering though how I should represent the datasources to pass
into datasources_open. An array combined with a length parameter?
Something like:

    static svn_error_t *
    datasources_open(void *baton, apr_off_t *prefix_lines,
                     svn_diff_datasource_e[] datasources, int datasources_len)

? And then use for loops everywhere I now do things twice for the two
datasources?

>> 5) I've struggled with making the code readable/understandable. It's
>> likely that there is still a lot of room for improvement. I also
>> probably need to document it some more.
>
> You need to write a full doc string for datasources_open(), at least.
> It needs especially to say how it relates to datasource_open() - why
> should the caller call this function rather than that function, and are
> they both necessary - and how the caller is supposed to use the
> 'prefix_lines' parameter.  And doc strings for
> increment/decrement_...().
>
> But this makes me think, it looks to me like this whole
> prefix-suffix-skipping functionality would fit better inside the
> lower-level diff algorithm rather than inside the "datasources_open"
> function.  Do you agree?
>
> It works as it is, but you were talking about wanting it to obey the
> standard token-parsing rules such as "ignore white space", and so on.
> It seems that things like this would be much better one level down.

Yes, I've been struggling with this. But I can't easily see it fit in
the lower levels right now. Problem is that everything in those lower
levels always acts on 1 datasource at a time (getting all the tokens,
inserting them into the token tree, ... then the same for the next
datasource). The datasource_open seemed to me to be the easiest place
to combine datasources to do things for both of them "concurrently"
(with least impact on the rest of the code).

Maybe those lower-level functions could also be made
"multi-datasource", but I have to think a little bit more about that.

>> 6) I've not paid too much attention to low level optimizations, so
>> here also there's probably room for improvement, which may be
>> significant for the critical loops.
>
> Maybe. Not important at this stage.
>
>> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
>> in diff_file.c#find_identical_suffix. To completely eliminate this, I
>> should probably make all "chunks" of type apr_off_t instead of int
>> (but I have not done that yet, because the original implementation
>> also used int for the chunk in the svn_diff__file_baton_t struct).
>> Should I do this (also in the baton struct)? Or should I use an
>> explicit cast somewhere?
>
> I recommend using an explicit cast where needed, in this patch.
> Changing the 'chunk' data type everywhere could be done in a separate
> patch, either before or after this patch.

Ok, will do.

One more thing I was wondering about: maybe increment_... and
decrement_... should (eventually) be implemented as macro's? In fact,
they are mostly just replacements for curp++ and curp-- but taking
into account that we have chunks. Just a thought ...

Thanks again for your input.

Cheers,

-- 
Johan
Received on 2010-10-09 14:21:51 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.