[PATCH] Speed-up of libsvn_diff using token counts

From: Morten Kloster <morklo_at_gmail.com>
Date: Wed, 18 May 2011 01:56:29 +0200

Log message:

Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_token_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses index, not pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
(svn_diff__position_t): Changed node member from being
a pointer to the node to an integer index for the node.
Updated method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.

Comments:
This patch can reduce the time required for a diff by a large factor
when comparing long files with large differences. As an extreme case,
splitting sqlite.c into two roughly equal parts and diffing them against
each other took about a minute on my laptop with the original code,
and 7-8 seconds with my patch. The speed-up is gained only if a
large fraction of the changed tokens are unique to one source or the
other. I can not see that the change would cause a significant
performance degradation in any reasonable case.

Counting the number of times each token appears is also useful for
potential additional features in the future, such as identifying moved
blocks outside of the longest common subsequence.

I have tested only svn_diff_diff_2 (that it gives identical results as
before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
completely analogous. Tested on a Windows 7 64-bit machine,
TortoiseSVN build script.

Morten Kloster (first patch)
Received on 2011-05-18 08:23:15 CEST

This message: [ Message body ]
Next message: Johan Corveleyn: "Re: svn commit: r1104309 - in /subversion/trunk/subversion: libsvn_repos/reporter.c tests/cmdline/lock_tests.py tests/cmdline/update_tests.py"
Previous message: Stefan Küng: "Re: enhancements for error reporting"
Next in thread: Johan Corveleyn: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Reply: Johan Corveleyn: "Re: [PATCH] Speed-up of libsvn_diff using token counts"
Reply: Mark Phippard: "Re: [PATCH] Speed-up of libsvn_diff using token counts"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]