Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 1 Dec 2010 13:34:48 +0100

On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
>> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
>> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> >> I am now considering to abandon the tokens-approach, for the following reasons:
>> > ...
>> >> So, unless someone can convince me otherwise, I'm probably going to
>> >> stop with the token approach. Because of 2), I don't think it's worth
>> >> it to spend the effort needed for 1), especially because the
>> >> byte-based approach already works.
>> >>
>> >
>> > In other words, you're saying that the token-based approach: (b) won't be
>> > as fast as the bytes-based approach can be, and (a) requires much effort
>> > to be spent on implementing the reverse reading of tokens? (i.e.,
>> > a backwards gets())
>>
>> Yes.
>>
>> The reverse reading is quite hard (in the diff_file.c case) because of
>> the chunked reading of files. A line may be split by a "chunk
>> boundary" (it may even be split in the middle of an eol sequence
>> (between \r and \n), and it all still needs to be
>> canonicalized/normalized correctly (that's why we'll also need a
>> reverse normalization function). The current forward get_next_token
>> does this very well, and the reverse should be analogous, but I just
>> find it hard to reason about, and to get it implemented correctly. It
>> will take me a lot of time, and with a high probability of introducing
>> subtle bugs for special edge cases.
>>
>
> OK, so a reverse get_next_token() could be tricky to implement. But,
> worse, won't having it mean that implementors of svn_diff_fns_t can't
> have streamy access to their source? Since they would be required to
> provide sometimes a token from the start and sometimes a token from the
> end...
>
> Well, it's not streamy in our usual sense of the word, but it's
> "double-streamy" (no one requests the middle byte until either all
> bytes before or all bytes after it were transmitted already)

Oooh, I hadn't considered other implementors besides the internal
diff_memory.c and diff_file.c. You're right, that would impose
additional constraints on other implementors. I don't know if being
non-streamy (or less streamy anyway) would be problem ...

In my last commit on the -tokens branch, I added a flag "open_at_end"
to the datasource_open function (function of svn_diff_fns_t), so the
datasource could be opened at the end. Also, other supporting
functions were needed: token_pushback_prefix, token_pushback_suffix
(to "push back" tokens that were read too far when scanning for
prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
the high-level idea was:

1) Open datasources at the end.

2) Scan for identical suffix (getting tokens in reverse).

3) At first non-match: pushback suffix token, and note where suffix starts.

4) Open datasources at the beginning.

5) Scan for identical prefix (getting tokens normally, but without hash).

6) At first non-match: pushback prefix token.

7) Run the "old algorithm": getting tokens forward, but with hash. The
get_next_token function should stop (return NULL for token) when it
arrives at the suffix.

Sidestep: I just now realized that I probably don't need to have the
"reverse normalization algorithm" for implementing get_previous_token.
The call to the normalization function in get_next_token is (I think)
only needed to be able to calculate the hash. But since
get_previous_token doesn't need to calculate hashes, I may be able to
get by without normalization there. I'd only need to normalize inside
token_compare, and I *think* I can just to that "forwardly", instead
of backwards. Just thinking out loud here ...

So: that makes the token-approach again a little bit more possible.
But do we want it? It requires a lot more from implementors of
svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
optimization to all implementors of svn_diff_fns_t ...

>> >> Any thoughts?
>> >>
>> >
>> > -tokens/BRANCH-README mentions one of the advantages of the tokens
>> > approach being that the comparison is done only after whitespace and
>> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
>> > effect). IIRC, the -bytes approach doesn't currently take advantage of
>> > these -x flags?
>> >
>> > What is the practical effect of the fact the -bytes approach doesn't
>> > take advantage of these flags? If a file (with a moderately long
>> > history) has had all its newlines changed in rN, then I assume that your
>> > -bytes optimizations will speed up all the diffs that 'blame' performs
>> > on that file, except for the single diff between rN and its predecessor?
>>
>> Yes, I thought that would be an important advantage of the tokens
>> approach, but as you suggest: the practical effect will most likely be
>> limited. Indeed, only this single diff between rN and its predecessor
>> (which may be 1 in 1000 diffs) will not benifit from
>> prefix/suffix-optimization. Even if the rate of such changes is like
>> 1/100th (let's say you change leading tabs to spaces, and vice versa,
>> every 100 revisions), it will hardly be noticeable.
>>
>> The perfectionist in me still thinks: hey, you can also implement this
>> normalization with the byte-based approach (in a streamy way). But
>> that will probably be quite hard, because I'd have to take into
>
> We can add that additional optimization after the basic -bytes support
> is working? No need to implement all possible optimizations in one
> shot (let alone those of them that might have little relative benefit).

Ok, that minor optimization can definitely wait.

So the choice still boils down to:
1) A specific implementation inside diff_file.c (byte-based).

2) A generic implementation for all implementors of svn_diff_fns_t,
which still requires some work (get_previous_token), and imposes more
constraints on svn_diff_fns_t implementors ...

-- 
Johan

Received on 2010-12-01 13:35:45 CET

This message: [ Message body ]
Next message: Julian Foad: "Re: svn commit: r1040663 - in /subversion/trunk/subversion/libsvn_wc: wc-queries.sql wc_db.c"
Previous message: Julian Foad: "Re: svn commit: r1040832 - Port a fix for a FSFS packing race"
In reply to: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Next in thread: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Reply: Daniel Shahaf: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]