[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Morten Kloster <morklo_at_gmail.com>
Date: Sun, 29 May 2011 20:02:27 +0200

On Sun, May 29, 2011 at 6:17 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Sun, May 29, 2011 at 6:00 PM, Bert Huijben <bert_at_qqmail.nl> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Morten Kloster [mailto:morklo_at_gmail.com]
>>> Sent: zondag 29 mei 2011 17:35
>>> To: Julian Foad
>>> Cc: Mark Phippard; dev_at_subversion.apache.org
>>> Subject: Re: [PATCH] Speed-up of libsvn_diff using token counts
>>>
>>> On Fri, May 27, 2011 at 7:57 PM, Julian Foad <julian.foad_at_wandisco.com>
>>> wrote:
>>> > Morten Kloster wrote:
>>> >> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <julian.foad_at_wandisco.com>
>>> wrote:
>>> >> > Morten Kloster wrote:
>>> >> >> I haven't changed the index/count types yet. What's the right type
>>> >> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>>> >> >> on 64-bit machines?
>>> >> >
>>> >> > "int"?
>>> >>
>>> >> Is int guaranteed to correspond to (or be larger than) single-process
>>> >> addressable space?
>>> >
>>> > No.  Some 64-bit platforms use 32-bit ints by default and 64-bit
>>> > pointers.
>>> >
>>> > But do you really need to guarantee that svn's text-diff will cope with
>>> > more than 2 billion lines on such a system?  Personally I don't think
>>> > so.  If you think that is important, you can use "long int".
>>
>> long int isn't guaranteed to be 64 bit either. (E.g. on Windows 64 long int
>> is just 32 bit)
>>
>> You need something like intptr_t if you want an integer type of pointer
>> size.
>>
>> But threating more than 4 billion lines of text as just every other textfile
>> doesn't seem necessary to me.
>>
>>        Bert
>>
>>
>
> And I forgot the attachment, as well... Ok, trying again, using intptr_t this
> time.
>
>
> Morten
>

Bert informed me that intptr_t might not be work on all systems, so I've
made a new typedef svn_diff__token_index_t and set it to long int for now.

[[[
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_node_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses token_index (int), not node pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
  (svn_diff__token_index_t): New type for token indices/counts
  (svn_diff__position_t): Replaced node pointer member
   by an integer token_index.
  Updated and added new method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.
  (svn_diff__get_token_counts): New method.
]]]

Received on 2011-05-29 20:02:58 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.