[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 18 May 2011 08:33:51 +0200

On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <morklo_at_gmail.com> wrote:
> Log message:
>
> Speed-up of libsvn_diff using token counts
> By using indices, not node pointers, to refer to tokens, and counting
> the number of each token, the longest common subsequence (lcs)
> algorithm at the heart of libsvn_diff becomes much faster in many
> situations by skipping tokens that are unique to one source or the other.
>
> * subversion/libsvn_diff/token.c
>  (svn_diff__node_t): Added 'index' member.
>  (svn_diff__tree_t): Added 'node_count' member.
>  (svn_diff__get_token_count): New method.
>  (tree_insert_token): Assigns indices to nodes.
>  (svn_diff__get_tokens): Uses index, not pointer,
>   to identify node in position struct.
>
> * subversion/libsvn_diff/lcs.c
>  (svn_diff__snake): Takes token counts as additional argument.
>   Skips tokens that are unique to one or the other file.
>  (svn_diff__lcs): Takes token counts and number of different tokens
>   as additional arguments. Calculates number of tokens unique to
>   each source and uses this to compute effective length difference.
>
> * subversion/libsvn_diff/diff.h
>  (svn_diff__position_t): Changed node member from being
>   a pointer to the node to an integer index for the node.
>  Updated method declarations.
>
> * subversion/libsvn_diff/diff.c
> * subversion/libsvn_diff/diff3.c
> * subversion/libsvn_diff/diff4.c
>  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
>   number of times each token appears in each source.
>  (svn_diff__resolve_conflict): Takes number of different tokens
>   as additional argument. Counts the number of times
>   each token appears in each (partial) source.
>
>
>
>
>
>
> Comments:
> This patch can reduce the time required for a diff by a large factor
> when comparing long files with large differences. As an extreme case,
> splitting sqlite.c into two roughly equal parts and diffing them against
> each other took about a minute on my laptop with the original code,
> and 7-8 seconds with my patch. The speed-up is gained only if a
> large fraction of the changed tokens are unique to one source or the
> other. I can not see that the change would cause a significant
> performance degradation in any reasonable case.
>
> Counting the number of times each token appears is also useful for
> potential additional features in the future, such as identifying moved
> blocks outside of the longest common subsequence.
>
> I have tested only svn_diff_diff_2 (that it gives identical results as
> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
> completely analogous. Tested on a Windows 7 64-bit machine,
> TortoiseSVN build script.
>
>
> Morten Kloster (first patch)

Interesting! The patch seems to be missing though. The list software
tends to strip off attachments if they don't have a text/plain
mime-type. Can you try sending it again with a .txt extension?

-- 
Johan
Received on 2011-05-18 08:34:47 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.