Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Wed, 1 Dec 2010 12:44:33 +0200

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> I am now considering to abandon the tokens-approach, for the following reasons:
> > ...
> >> So, unless someone can convince me otherwise, I'm probably going to
> >> stop with the token approach. Because of 2), I don't think it's worth
> >> it to spend the effort needed for 1), especially because the
> >> byte-based approach already works.
> >>
> >
> > In other words, you're saying that the token-based approach: (b) won't be
> > as fast as the bytes-based approach can be, and (a) requires much effort
> > to be spent on implementing the reverse reading of tokens? (i.e.,
> > a backwards gets())
>
> Yes.
>
> The reverse reading is quite hard (in the diff_file.c case) because of
> the chunked reading of files. A line may be split by a "chunk
> boundary" (it may even be split in the middle of an eol sequence
> (between \r and \n), and it all still needs to be
> canonicalized/normalized correctly (that's why we'll also need a
> reverse normalization function). The current forward get_next_token
> does this very well, and the reverse should be analogous, but I just
> find it hard to reason about, and to get it implemented correctly. It
> will take me a lot of time, and with a high probability of introducing
> subtle bugs for special edge cases.
>

OK, so a reverse get_next_token() could be tricky to implement. But,
worse, won't having it mean that implementors of svn_diff_fns_t can't
have streamy access to their source? Since they would be required to
provide sometimes a token from the start and sometimes a token from the
end...

Well, it's not streamy in our usual sense of the word, but it's
"double-streamy" (no one requests the middle byte until either all
bytes before or all bytes after it were transmitted already)

> >> Any thoughts?
> >>
> >
> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> > approach being that the comparison is done only after whitespace and
> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> > effect). IIRC, the -bytes approach doesn't currently take advantage of
> > these -x flags?
> >
> > What is the practical effect of the fact the -bytes approach doesn't
> > take advantage of these flags? If a file (with a moderately long
> > history) has had all its newlines changed in rN, then I assume that your
> > -bytes optimizations will speed up all the diffs that 'blame' performs
> > on that file, except for the single diff between rN and its predecessor?
>
> Yes, I thought that would be an important advantage of the tokens
> approach, but as you suggest: the practical effect will most likely be
> limited. Indeed, only this single diff between rN and its predecessor
> (which may be 1 in 1000 diffs) will not benifit from
> prefix/suffix-optimization. Even if the rate of such changes is like
> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> every 100 revisions), it will hardly be noticeable.
>
> The perfectionist in me still thinks: hey, you can also implement this
> normalization with the byte-based approach (in a streamy way). But
> that will probably be quite hard, because I'd have to take into

We can add that additional optimization after the basic -bytes support
is working? No need to implement all possible optimizations in one
shot (let alone those of them that might have little relative benefit).

> account that I might have to read more bytes from one file than from
> the other file (right now I can always simply read a single byte from
> both files). And other than that, it will at least be "algorithm
> duplication": it will be a (streamy) re-implementation of the
> normalization algorithm that's already used in the token layer. So all
> in all it's probably not worth it ...
>
> Cheers,
> --
> Johan
Received on 2010-12-01 11:46:52 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: svn commit: r1040831 - in /subversion/trunk/subversion: include/svn_checksum.h libsvn_subr/checksum.c"
Previous message: Stefan Sperling: "Re: svn commit: r1040831 - in /subversion/trunk/subversion: include/svn_checksum.h libsvn_subr/checksum.c"
In reply to: Johan Corveleyn: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Next in thread: Johan Corveleyn: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"
Reply: Johan Corveleyn: "Re: diff-optimizations-tokens branch: I think I'm going to abandon it"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]