[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Morten Kloster <morklo_at_gmail.com>
Date: Sun, 29 May 2011 17:35:12 +0200

On Fri, May 27, 2011 at 7:57 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> Morten Kloster wrote:
>> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> > Morten Kloster wrote:
>> >> I haven't changed the index/count types yet. What's the right type
>> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>> >> on 64-bit machines?
>> >
>> > "int"?
>> Is int guaranteed to correspond to (or be larger than) single-process
>> addressable space?
> No.  Some 64-bit platforms use 32-bit ints by default and 64-bit
> pointers.
> But do you really need to guarantee that svn's text-diff will cope with
> more than 2 billion lines on such a system?  Personally I don't think
> so.  If you think that is important, you can use "long int".
> - Julian

Here's an updated version which uses int. I have also moved the
token counting to a separate function (as it was repeated in
many places), resolved conflicts from the revision 1128857, and
renamed node to token_index in svn_diff__position_t.

I've checked that this new version passes all libsvn_diff tests
and gives identical results to current HEAD for all diffs
between any of the first 100 revisions of merge.c in libsvn_client.

To my knowledge, this version should be ready for commit. It will
need minor conflict resolution to fit with the fp argument patch.

Updated log message:
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_node_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses token_index (int), not node pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
  (svn_diff__position_t): Replaced node pointer member
   by an integer token_index.
  Updated and added new method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.
  (svn_diff__get_token_counts): New method.

Received on 2011-05-29 17:35:47 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.