On Wed, May 18, 2011 at 8:33 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <morklo_at_gmail.com> wrote:
>> Log message:
>>
>> Speed-up of libsvn_diff using token counts
>> By using indices, not node pointers, to refer to tokens, and counting
>> the number of each token, the longest common subsequence (lcs)
>> algorithm at the heart of libsvn_diff becomes much faster in many
>> situations by skipping tokens that are unique to one source or the other.
>>
>> * subversion/libsvn_diff/token.c
>> (svn_diff__node_t): Added 'index' member.
>> (svn_diff__tree_t): Added 'node_count' member.
>> (svn_diff__get_token_count): New method.
>> (tree_insert_token): Assigns indices to nodes.
>> (svn_diff__get_tokens): Uses index, not pointer,
>> to identify node in position struct.
>>
>> * subversion/libsvn_diff/lcs.c
>> (svn_diff__snake): Takes token counts as additional argument.
>> Skips tokens that are unique to one or the other file.
>> (svn_diff__lcs): Takes token counts and number of different tokens
>> as additional arguments. Calculates number of tokens unique to
>> each source and uses this to compute effective length difference.
>>
>> * subversion/libsvn_diff/diff.h
>> (svn_diff__position_t): Changed node member from being
>> a pointer to the node to an integer index for the node.
>> Updated method declarations.
>>
>> * subversion/libsvn_diff/diff.c
>> * subversion/libsvn_diff/diff3.c
>> * subversion/libsvn_diff/diff4.c
>> (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
>> number of times each token appears in each source.
>> (svn_diff__resolve_conflict): Takes number of different tokens
>> as additional argument. Counts the number of times
>> each token appears in each (partial) source.
>>
>>
>>
>>
>>
>>
>> Comments:
>> This patch can reduce the time required for a diff by a large factor
>> when comparing long files with large differences. As an extreme case,
>> splitting sqlite.c into two roughly equal parts and diffing them against
>> each other took about a minute on my laptop with the original code,
>> and 7-8 seconds with my patch. The speed-up is gained only if a
>> large fraction of the changed tokens are unique to one source or the
>> other. I can not see that the change would cause a significant
>> performance degradation in any reasonable case.
>>
>> Counting the number of times each token appears is also useful for
>> potential additional features in the future, such as identifying moved
>> blocks outside of the longest common subsequence.
>>
>> I have tested only svn_diff_diff_2 (that it gives identical results as
>> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
>> completely analogous. Tested on a Windows 7 64-bit machine,
>> TortoiseSVN build script.
>>
>>
>> Morten Kloster (first patch)
>
> Interesting! The patch seems to be missing though. The list software
> tends to strip off attachments if they don't have a text/plain
> mime-type. Can you try sending it again with a .txt extension?
>
> --
> Johan
>
Sorry, here's the patch with .txt extension. I found a bug in the
first version anyway, so it was just as well. :)
And of course I meant sqlite3.c for the stress test.
Morten
Received on 2011-05-18 10:31:45 CEST