[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 2 Dec 2010 07:43:17 +0200

[ finally getting back to this mail; having slept on it, etc. ]

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> >> I am now considering to abandon the tokens-approach, for the following reasons:
> >> > ...
> >> >> So, unless someone can convince me otherwise, I'm probably going to
> >> >> stop with the token approach. Because of 2), I don't think it's worth
> >> >> it to spend the effort needed for 1), especially because the
> >> >> byte-based approach already works.
> >> >>
> >> >
> >> > In other words, you're saying that the token-based approach: (b) won't be
> >> > as fast as the bytes-based approach can be, and (a) requires much effort
> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> >> > a backwards gets())
> >>
> >> Yes.
> >>
> >> The reverse reading is quite hard (in the diff_file.c case) because of
> >> the chunked reading of files. A line may be split by a "chunk
> >> boundary" (it may even be split in the middle of an eol sequence
> >> (between \r and \n), and it all still needs to be
> >> canonicalized/normalized correctly (that's why we'll also need a
> >> reverse normalization function). The current forward get_next_token
> >> does this very well, and the reverse should be analogous, but I just
> >> find it hard to reason about, and to get it implemented correctly. It
> >> will take me a lot of time, and with a high probability of introducing
> >> subtle bugs for special edge cases.
> >>
> >
> > OK, so a reverse get_next_token() could be tricky to implement.  But,
> > worse, won't having it mean that implementors of svn_diff_fns_t can't
> > have streamy access to their source?  Since they would be required to
> > provide sometimes a token from the start and sometimes a token from the
> > end...
> >
> > Well, it's not streamy in our usual sense of the word, but it's
> > "double-streamy" (no one requests the middle byte until either all
> > bytes before or all bytes after it were transmitted already)
> Oooh, I hadn't considered other implementors besides the internal
> diff_memory.c and diff_file.c.

Just to clarify: diff_file.c and diff_memory.c are the only implementors
of svn_diff_fns_t in the core code, correct?

> You're right, that would impose
> additional constraints on other implementors. I don't know if being
> non-streamy (or less streamy anyway) would be problem ...

We should have asked this before, but:

Do we know who are the other implementors / typical use-cases of

> In my last commit on the -tokens branch, I added a flag "open_at_end"
> to the datasource_open function (function of svn_diff_fns_t), so the
> datasource could be opened at the end. Also, other supporting
> functions were needed: token_pushback_prefix, token_pushback_suffix
> (to "push back" tokens that were read too far when scanning for
> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
> the high-level idea was:
> 1) Open datasources at the end.
> 2) Scan for identical suffix (getting tokens in reverse).
> 3) At first non-match: pushback suffix token, and note where suffix starts.
> 4) Open datasources at the beginning.
> 5) Scan for identical prefix (getting tokens normally, but without hash).
> 6) At first non-match: pushback prefix token.
> 7) Run the "old algorithm": getting tokens forward, but with hash. The
> get_next_token function should stop (return NULL for token) when it
> arrives at the suffix.
> Sidestep: I just now realized that I probably don't need to have the
> "reverse normalization algorithm" for implementing get_previous_token.
> The call to the normalization function in get_next_token is (I think)
> only needed to be able to calculate the hash. But since
> get_previous_token doesn't need to calculate hashes, I may be able to
> get by without normalization there. I'd only need to normalize inside
> token_compare, and I *think* I can just to that "forwardly", instead
> of backwards. Just thinking out loud here ...

Is "normalization function" the thing that collapses newlines and
whitespaces if -x-w or -x--ignore-eol-style is in effect? If so,
another purpose of calling that might be to make suffixes that are
not bytewise-identical, but only modulo-whitespace-identical, also
be lumped into the "identical suffix".

(Haven't dived into the code yet; the above is based on my understanding
of your high-level descriptions)

> So: that makes the token-approach again a little bit more possible.
> But do we want it? It requires a lot more from implementors of
> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> optimization to all implementors of svn_diff_fns_t ...

Another option is to add the "read backwards" callbacks to the API, but
designate them as optional. Then you only do the suffix half of the
optimization of both/all sources support the backwards callbacks.

(Is it a good idea? Don't know. But it's one option.)

> >> >> Any thoughts?
> >> >>
> >> >
> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> >> > approach being that the comparison is done only after whitespace and
> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> >> > these -x flags?
> >> >
> >> > What is the practical effect of the fact the -bytes approach doesn't
> >> > take advantage of these flags?  If a file (with a moderately long
> >> > history) has had all its newlines changed in rN, then I assume that your
> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
> >> > on that file, except for the single diff between rN and its predecessor?
> >>
> >> Yes, I thought that would be an important advantage of the tokens
> >> approach, but as you suggest: the practical effect will most likely be
> >> limited. Indeed, only this single diff between rN and its predecessor
> >> (which may be 1 in 1000 diffs) will not benifit from
> >> prefix/suffix-optimization. Even if the rate of such changes is like
> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> >> every 100 revisions), it will hardly be noticeable.
> >>
> >> The perfectionist in me still thinks: hey, you can also implement this
> >> normalization with the byte-based approach (in a streamy way). But
> >> that will probably be quite hard, because I'd have to take into
> >
> > We can add that additional optimization after the basic -bytes support
> > is working?  No need to implement all possible optimizations in one
> > shot (let alone those of them that might have little relative benefit).
> Ok, that minor optimization can definitely wait.
> So the choice still boils down to:
> 1) A specific implementation inside diff_file.c (byte-based).

Not a bad option, I think. The improvement isn't as useful in
diff_memory.c, since that API isn't likely to be used for large
amounts of data anyway (e.g., trunk uses it only for propery diffs).

> 2) A generic implementation for all implementors of svn_diff_fns_t,
> which still requires some work (get_previous_token), and imposes more
> constraints on svn_diff_fns_t implementors ...

Might also be a not bad option... :-)

> --
> Johan
Received on 2010-12-02 06:45:21 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.