Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Sat, 09 Oct 2010 01:57:19 +0100

On Sat, 2010-10-09, Johan Corveleyn wrote:
> Ok, third iteration of the patch in attachment. It passes make check.
>
> As discussed in [1], this version keeps 50 lines of the identical
> suffix around, to give the algorithm a good chance to generate a diff
> output of good quality (in all but the most extreme cases, this will
> be the same as with the original svn_diff algorithm).
>
> That's about the only difference with the previous iteration. So for
> now, I'm submitting this for review. Any feedback is very welcome :-).

Hi Johan.

I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch. I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable. I have a suggestion that may help.

I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
const svn_diff_file_options_t *options;
const char *path[4];

apr_file_t *file[4];
apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

/* List of free tokens that may be reused. */
svn_diff__file_token_t *tokens;

svn_diff__normalize_state_t normalize_state[4];

apr_pool_t *pool;
} svn_diff__file_baton_t;

All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.

So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.

I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.

Would you take a look and let me know if you agree. If so, I can commit
the refactoring straight away.

Responding to some of the other points you mentioned in a much earlier
mail:

> 3) It's only implemented for 2 files. I'd like to generalize this for
> an array of datasources, so it can also be used for diff3 and diff4.
>
> 4) As a small hack, I had to add a flag "datasource_opened" to
> token.c#svn_diff__get_tokens, so I could have different behavior for
> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> no longer needed.

Yes, I'd like to see 3), and so hack 4) will go away.

> 5) I've struggled with making the code readable/understandable. It's
> likely that there is still a lot of room for improvement. I also
> probably need to document it some more.

You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter. And doc strings for
increment/decrement_...().

But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the "datasources_open"
function. Do you agree?

It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as "ignore white space", and so on.
It seems that things like this would be much better one level down.

> 6) I've not paid too much attention to low level optimizations, so
> here also there's probably room for improvement, which may be
> significant for the critical loops.

Maybe. Not important at this stage.

> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> should probably make all "chunks" of type apr_off_t instead of int
> (but I have not done that yet, because the original implementation
> also used int for the chunk in the svn_diff__file_baton_t struct).
> Should I do this (also in the baton struct)? Or should I use an
> explicit cast somewhere?

I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.

- Julian

> I still consider this a WIP, because of the following remaining todo's
> (which may have a lot of impact on the current implementation):
> - Generalize for more than 2 datasources (for diff3 and diff4).
> - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
> - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
> switch the implementation to read out entire lines before comparing
> (like datasources_get_next_token does).
>
> Log message:
> [[[
> Make svn_diff_diff skip identical prefix and suffix to make diff and blame
> faster.
>
> * subversion/include/svn_diff.h
> (svn_diff_fns_t): Added new function types datasources_open and
> get_prefix_lines to the vtable.
>
> * subversion/libsvn_diff/diff_memory.c
> (datasources_open): New function (does nothing).
> (get_prefix_lines): New function (does nothing).
> (svn_diff__mem_vtable): Added new functions datasources_open and
> get_prefix_lines.
>
> * subversion/libsvn_diff/diff_file.c
> (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
> and suffix_offset_in_chunk.
> (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
> (find_identical_prefix, find_identical_suffix): New functions.
> (datasources_open): New function, to open both datasources and find their
> identical prefix and suffix. From the identical suffix, 50 lines are kept to
> help the diff algorithm find the nicest possible diff representation
> in case of ambiguity.
> (get_prefix_lines): New function.
> (datasource_get_next_token): Stop at start of identical suffix.
> (svn_diff__file_vtable): Added new functions datasources_open and
> get_prefix_lines.
>
> * subversion/libsvn_diff/diff.h
> (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
> the datasource was already opened.
>
> * subversion/libsvn_diff/token.c
> (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
> datasource if datasource_opened is FALSE. Set the starting offset of the
> position list to the number of prefix lines.
>
> * subversion/libsvn_diff/lcs.c
> (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
> the offset of the sentinel position for EOF, even if one of the files
> became empty after eliminating the identical prefix.
>
> * subversion/libsvn_diff/diff.c
> (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
> (svn_diff_diff): Use new function datasources_open, to open original and
> modified at once, and find their identical prefix and suffix. Pass
> prefix_lines to svn_diff__lcs and to svn_diff__diff.
>
> * subversion/libsvn_diff/diff3.c
> (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
> Pass prefix_lines = 0 to svn_diff__lcs.
>
> * subversion/libsvn_diff/diff4.c
> (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
> Pass prefix_lines = 0 to svn_diff__lcs.
> ]]]
>
>
> Cheers,

text/x-patch attachment: diff-refactor-vars.patch

Received on 2010-10-09 02:58:07 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]