On Sun, May 29, 2011 at 6:17 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Sun, May 29, 2011 at 6:00 PM, Bert Huijben <bert_at_qqmail.nl> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Morten Kloster [mailto:morklo_at_gmail.com]
>>> Sent: zondag 29 mei 2011 17:35
>>> To: Julian Foad
>>> Cc: Mark Phippard; dev_at_subversion.apache.org
>>> Subject: Re: [PATCH] Speed-up of libsvn_diff using token counts
>>>
>>> On Fri, May 27, 2011 at 7:57 PM, Julian Foad <julian.foad_at_wandisco.com>
>>> wrote:
>>> > Morten Kloster wrote:
>>> >> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <julian.foad_at_wandisco.com>
>>> wrote:
>>> >> > Morten Kloster wrote:
>>> >> >> I haven't changed the index/count types yet. What's the right type
>>> >> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>>> >> >> on 64-bit machines?
>>> >> >
>>> >> > "int"?
>>> >>
>>> >> Is int guaranteed to correspond to (or be larger than) single-process
>>> >> addressable space?
>>> >
>>> > No. Some 64-bit platforms use 32-bit ints by default and 64-bit
>>> > pointers.
>>> >
>>> > But do you really need to guarantee that svn's text-diff will cope with
>>> > more than 2 billion lines on such a system? Personally I don't think
>>> > so. If you think that is important, you can use "long int".
>>
>> long int isn't guaranteed to be 64 bit either. (E.g. on Windows 64 long int
>> is just 32 bit)
>>
>> You need something like intptr_t if you want an integer type of pointer
>> size.
>>
>> But threating more than 4 billion lines of text as just every other textfile
>> doesn't seem necessary to me.
>>
>> Bert
>>
>>
>
> And I forgot the attachment, as well... Ok, trying again, using intptr_t this
> time.
>
>
> Morten
>
Bert informed me that intptr_t might not be work on all systems, so I've
made a new typedef svn_diff__token_index_t and set it to long int for now.
[[[
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.
* subversion/libsvn_diff/token.c
(svn_diff__node_t): Added 'index' member.
(svn_diff__tree_t): Added 'node_count' member.
(svn_diff__get_node_count): New method.
(tree_insert_token): Assigns indices to nodes.
(svn_diff__get_tokens): Uses token_index (int), not node pointer,
to identify node in position struct.
* subversion/libsvn_diff/lcs.c
(svn_diff__snake): Takes token counts as additional argument.
Skips tokens that are unique to one or the other file.
(svn_diff__lcs): Takes token counts and number of different tokens
as additional arguments. Calculates number of tokens unique to
each source and uses this to compute effective length difference.
* subversion/libsvn_diff/diff.h
(svn_diff__token_index_t): New type for token indices/counts
(svn_diff__position_t): Replaced node pointer member
by an integer token_index.
Updated and added new method declarations.
* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
(svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
number of times each token appears in each source.
(svn_diff__resolve_conflict): Takes number of different tokens
as additional argument. Counts the number of times
each token appears in each (partial) source.
(svn_diff__get_token_counts): New method.
]]]
Received on 2011-05-29 20:02:58 CEST