[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sun, 3 Oct 2010 01:46:08 +0200

Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the "EOF" marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,

-- 
Johan

Received on 2010-10-03 01:46:47 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.