On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <morklo_at_gmail.com> wrote:
> Log message:
> Speed-up of libsvn_diff using token counts
> By using indices, not node pointers, to refer to tokens, and counting
> the number of each token, the longest common subsequence (lcs)
> algorithm at the heart of libsvn_diff becomes much faster in many
> situations by skipping tokens that are unique to one source or the other.
> * subversion/libsvn_diff/token.c
> (svn_diff__node_t): Added 'index' member.
> (svn_diff__tree_t): Added 'node_count' member.
> (svn_diff__get_token_count): New method.
> (tree_insert_token): Assigns indices to nodes.
> (svn_diff__get_tokens): Uses index, not pointer,
> to identify node in position struct.
> * subversion/libsvn_diff/lcs.c
> (svn_diff__snake): Takes token counts as additional argument.
> Skips tokens that are unique to one or the other file.
> (svn_diff__lcs): Takes token counts and number of different tokens
> as additional arguments. Calculates number of tokens unique to
> each source and uses this to compute effective length difference.
> * subversion/libsvn_diff/diff.h
> (svn_diff__position_t): Changed node member from being
> a pointer to the node to an integer index for the node.
> Updated method declarations.
> * subversion/libsvn_diff/diff.c
> * subversion/libsvn_diff/diff3.c
> * subversion/libsvn_diff/diff4.c
> (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
> number of times each token appears in each source.
> (svn_diff__resolve_conflict): Takes number of different tokens
> as additional argument. Counts the number of times
> each token appears in each (partial) source.
> This patch can reduce the time required for a diff by a large factor
> when comparing long files with large differences. As an extreme case,
> splitting sqlite.c into two roughly equal parts and diffing them against
> each other took about a minute on my laptop with the original code,
> and 7-8 seconds with my patch. The speed-up is gained only if a
> large fraction of the changed tokens are unique to one source or the
> other. I can not see that the change would cause a significant
> performance degradation in any reasonable case.
> Counting the number of times each token appears is also useful for
> potential additional features in the future, such as identifying moved
> blocks outside of the longest common subsequence.
> I have tested only svn_diff_diff_2 (that it gives identical results as
> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
> completely analogous. Tested on a Windows 7 64-bit machine,
> TortoiseSVN build script.
> Morten Kloster (first patch)
Interesting! The patch seems to be missing though. The list software
tends to strip off attachments if they don't have a text/plain
mime-type. Can you try sending it again with a .txt extension?
Received on 2011-05-18 08:34:47 CEST