[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 20 Oct 2010 00:41:24 +0200

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.

Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,

-- 
Johan

Received on 2010-10-20 00:42:26 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.