diff-optimizations-tokens branch: I think I'm going to abandon it

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 1 Dec 2010 00:25:27 +0100

Hi devs,

As mentioned in [1], I've created two branches to try out two
different approaches for the diff optimizations of prefix/suffix
scanning.

The first one, diff-optimizations-bytes, has a working implementation
of the optimization. It still has some open todo items, but it
basically works.

The second one, diff-optimizations-tokens, takes a more high-level
approach by working in the "token handling layer". It takes the
extracted lines as a whole, and compares them, to scan for identical
prefix and suffix. I preferred this "new" approach, because it seemed
more elegant (and works both for diff_file and diff_memory (property
diffs)). However, although the token-based prefix scanning works
adequately, I'm now stuck with the suffix scanning.

I am now considering to abandon the tokens-approach, for the following reasons:

1) There is still a lot of work. Scanning for identical suffix is
quite difficult, because we now have to extract tokens (lines) in
reverse. I've put in place a stub for that function
(datasource_get_previous_token), but that still needs to be
implemented. And that's the hardest part, IMHO.

Not only that, but I just realized that I'll also have to implement a
reverse variant of util.c#svn_diff__normalize_buffer (which contains
the encouraging comment "It only took me forever to get this routine
right,..." (added by ehu in r866123)), and maybe also of token_compare
(not sure).

2) I'm beginning to see that token-based suffix scanning will not be
as fast as byte-based suffix scanning. Simply because, in the case of
byte-based suffix scanning, we can completely ignore line structure.
We never have to compare characters with \n or \r, we just keep
reading bytes and comparing them. So there is an extra overhead for
token-based suffix scanning.

So, unless someone can convince me otherwise, I'm probably going to
stop with the token approach. Because of 2), I don't think it's worth
it to spend the effort needed for 1), especially because the
byte-based approach already works.

Any thoughts?

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2010-11/0416.shtml

Received on 2010-12-01 00:26:28 CET

This message: [ Message body ]
Next message: Danny Trebbien: "Re: [PATCH] extend svn_subst_translate_string() to record whether re-encoding and/or line ending translation were performed (v. 4)"
Next in thread: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Reply: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]