On Fri, May 27, 2011 at 7:57 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> Morten Kloster wrote:
>> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> > Morten Kloster wrote:
>> >> I haven't changed the index/count types yet. What's the right type
>> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>> >> on 64-bit machines?
>> > "int"?
>> Is int guaranteed to correspond to (or be larger than) single-process
>> addressable space?
> No. Some 64-bit platforms use 32-bit ints by default and 64-bit
> But do you really need to guarantee that svn's text-diff will cope with
> more than 2 billion lines on such a system? Personally I don't think
> so. If you think that is important, you can use "long int".
> - Julian
Here's an updated version which uses int. I have also moved the
token counting to a separate function (as it was repeated in
many places), resolved conflicts from the revision 1128857, and
renamed node to token_index in svn_diff__position_t.
I've checked that this new version passes all libsvn_diff tests
and gives identical results to current HEAD for all diffs
between any of the first 100 revisions of merge.c in libsvn_client.
To my knowledge, this version should be ready for commit. It will
need minor conflict resolution to fit with the fp argument patch.
Updated log message:
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.
(svn_diff__node_t): Added 'index' member.
(svn_diff__tree_t): Added 'node_count' member.
(svn_diff__get_node_count): New method.
(tree_insert_token): Assigns indices to nodes.
(svn_diff__get_tokens): Uses token_index (int), not node pointer,
to identify node in position struct.
(svn_diff__snake): Takes token counts as additional argument.
Skips tokens that are unique to one or the other file.
(svn_diff__lcs): Takes token counts and number of different tokens
as additional arguments. Calculates number of tokens unique to
each source and uses this to compute effective length difference.
(svn_diff__position_t): Replaced node pointer member
by an integer token_index.
Updated and added new method declarations.
(svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
number of times each token appears in each source.
(svn_diff__resolve_conflict): Takes number of different tokens
as additional argument. Counts the number of times
each token appears in each (partial) source.
(svn_diff__get_token_counts): New method.
Received on 2011-05-29 17:35:47 CEST