[WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 27 Sep 2010 04:24:29 +0200

Hi devs,

As discussed in [1], here is a patch that makes svn_diff_diff
(libsvn_diff/diff.c) skip the identical prefix and suffix of the
original and modified files, before starting the LCS (longest common
subsequence) algorithm on the "non-matching" part.

This makes diff a lot faster (especially for large files), thereby
also speeding up blame in environments where the client is the
bottleneck (with a fast enough server and network, blame is
constrained by the speed of diff on the client side).

This patch still has some rough edges (see below), hence the WIP
marker. Nevertheless, it works "most of the time", and it can
demonstrate the speed improvement that can be expected. I will
continue working on the "rough edges", but in the meantime any
feedback/review/thoughts are very much appreciated.

The rough edges:
1) While scanning through the identical prefix, I have to count the
number of lines (to have correct line offsets for the rest of the
algorithm). To do that, I currently count the number of \n's, so it
won't work for files with \r eol-style. Not so difficult to fix
(similar to how diff_file.c#datasource_get_next_token does it), but I
haven't done it yet.

2) Two tests still fail:
- blame_tests.py 2: annotate a binary file
- blame_tests.py 7: blame with different eol styles
Maybe this is because of 1), I'll have to investigate.

3) It's only implemented for 2 files. I'd like to generalize this for
an array of datasources, so it can also be used for diff3 and diff4.

4) As a small hack, I had to add a flag "datasource_opened" to
token.c#svn_diff__get_tokens, so I could have different behavior for
regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
no longer needed.

5) I've struggled with making the code readable/understandable. It's
likely that there is still a lot of room for improvement. I also
probably need to document it some more.

6) I've not paid too much attention to low level optimizations, so
here also there's probably room for improvement, which may be
significant for the critical loops.

7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
in diff_file.c#find_identical_suffix. To completely eliminate this, I
should probably make all "chunks" of type apr_off_t instead of int
(but I have not done that yet, because the original implementation
also used int for the chunk in the svn_diff__file_baton_t struct).
Should I do this (also in the baton struct)? Or should I use an
explicit cast somewhere?

8) A bigger problem: the output of diff/blame is sometimes different
from the original implementation. It's always syntactically correct,
but sometimes the "less unique lines" are taken differently (like when
a new function is added, and diff thinks the new block starts from the
closing brace of the previous function, ... stuff like that). This
happens only because of the identical-suffix-scanning (it doesn't
happen if I only enable the identical-prefix-scanning).

I think I know the cause of this change in behavior: I completely
eliminate the suffix from the LCS-building algorithm. And that
probably causes it to sometimes come up with another "longest common
subsequence". Right now I don't know how to solve this completely. A
workaround might be to add a certain number of suffix-lines to the
"non-matching-block", so they can be part of the LCS-searching. But
probably one can always come up with examples where the number of
suffix-lines is not enough. I'll have to look into this some more.
Ideas welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
(svn_diff_fns_t): Added new function types datasources_open and
get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New fuctions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
(svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix.

* subversion/libsvn_diff/diff3.c
(svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.

* subversion/libsvn_diff/diff4.c
(svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
]]]

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2010-09/0284.shtml

text/plain attachment: patch_diff_identical_prefix_suffix.txt

Received on 2010-09-27 04:25:08 CEST

This message: [ Message body ]
Next message: Ramkumar Ramachandra: "Re: place of svnrdump"
Previous message: Neels J Hofmeyr: "Re: place of svnrdump"
Next in thread: Daniel Shahaf: "Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster"
Reply: Daniel Shahaf: "Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]