[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 2 Dec 2010 14:59:23 +0100

On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> [ finally getting back to this mail; having slept on it, etc. ]
>
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
>> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
>> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
>> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
>> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> >> >> I am now considering to abandon the tokens-approach, for the following reasons:
>> >> > ...
>> >> >> So, unless someone can convince me otherwise, I'm probably going to
>> >> >> stop with the token approach. Because of 2), I don't think it's worth
>> >> >> it to spend the effort needed for 1), especially because the
>> >> >> byte-based approach already works.
>> >> >>
>> >> >
>> >> > In other words, you're saying that the token-based approach: (b) won't be
>> >> > as fast as the bytes-based approach can be, and (a) requires much effort
>> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
>> >> > a backwards gets())
>> >>
>> >> Yes.
>> >>
>> >> The reverse reading is quite hard (in the diff_file.c case) because of
>> >> the chunked reading of files. A line may be split by a "chunk
>> >> boundary" (it may even be split in the middle of an eol sequence
>> >> (between \r and \n), and it all still needs to be
>> >> canonicalized/normalized correctly (that's why we'll also need a
>> >> reverse normalization function). The current forward get_next_token
>> >> does this very well, and the reverse should be analogous, but I just
>> >> find it hard to reason about, and to get it implemented correctly. It
>> >> will take me a lot of time, and with a high probability of introducing
>> >> subtle bugs for special edge cases.
>> >>
>> >
>> > OK, so a reverse get_next_token() could be tricky to implement.  But,
>> > worse, won't having it mean that implementors of svn_diff_fns_t can't
>> > have streamy access to their source?  Since they would be required to
>> > provide sometimes a token from the start and sometimes a token from the
>> > end...
>> >
>> > Well, it's not streamy in our usual sense of the word, but it's
>> > "double-streamy" (no one requests the middle byte until either all
>> > bytes before or all bytes after it were transmitted already)
>>
>> Oooh, I hadn't considered other implementors besides the internal
>> diff_memory.c and diff_file.c.
>
> Just to clarify: diff_file.c and diff_memory.c are the only implementors
> of svn_diff_fns_t in the core code, correct?

Yes.

>> You're right, that would impose
>> additional constraints on other implementors. I don't know if being
>> non-streamy (or less streamy anyway) would be problem ...
>>
>
> We should have asked this before, but:
>
> Do we know who are the other implementors / typical use-cases of
> svn_diff_fns_t?

Yeah, was wondering about this too. Are there in fact other
implementors? Maybe plugins for IDE's or something? How could we find
out? How "public" is this API in fact?

For blame, I know all revisions are first converted to full-texts,
which are written out to temp files. Then diff_file.c works on those
two files.

>> In my last commit on the -tokens branch, I added a flag "open_at_end"
>> to the datasource_open function (function of svn_diff_fns_t), so the
>> datasource could be opened at the end. Also, other supporting
>> functions were needed: token_pushback_prefix, token_pushback_suffix
>> (to "push back" tokens that were read too far when scanning for
>> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
>> the high-level idea was:
>>
>> 1) Open datasources at the end.
>>
>> 2) Scan for identical suffix (getting tokens in reverse).
>>
>> 3) At first non-match: pushback suffix token, and note where suffix starts.
>>
>> 4) Open datasources at the beginning.
>>
>> 5) Scan for identical prefix (getting tokens normally, but without hash).
>>
>> 6) At first non-match: pushback prefix token.
>>
>> 7) Run the "old algorithm": getting tokens forward, but with hash. The
>> get_next_token function should stop (return NULL for token) when it
>> arrives at the suffix.
>>
>>
>> Sidestep: I just now realized that I probably don't need to have the
>> "reverse normalization algorithm" for implementing get_previous_token.
>> The call to the normalization function in get_next_token is (I think)
>> only needed to be able to calculate the hash. But since
>> get_previous_token doesn't need to calculate hashes, I may be able to
>> get by without normalization there. I'd only need to normalize inside
>> token_compare, and I *think* I can just to that "forwardly", instead
>> of backwards. Just thinking out loud here ...
>>
>
> Is "normalization function" the thing that collapses newlines and
> whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
> another purpose of calling that might be to make suffixes that are
> not bytewise-identical, but only modulo-whitespace-identical, also
> be lumped into the "identical suffix".
>
> (Haven't dived into the code yet; the above is based on my understanding
> of your high-level descriptions)

Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't
fully understand your suggestion above. The normalize function
actually transforms a given buffer by normalizing all characters
according to the ignore options (so if -x-w is given, all whitespace
is simply skipped, or with -x-b all whitespace is transformed into a
single space etc.). Are you suggesting that I don't have to transform
anything, but merely have to compare them "modulo-ignore-stuff", just
to know if they can be skipped, and further not touch them?

In diff_file.c the transformation actually uses the same source buffer
also as target buffer (I *think* that's for optimization reasons: no
need to memcopy too much; but I'm not sure, haven't studied that in
too much detail). That actually caused me some headaches with the
-tokens approach, because get_next_token actually changes the buffer.
That's why I had to write a little bit more sophisticated
token_pushback function (I couldn't simply roll back the pointers,
because the next time get_next_token would be called, it would not
calculate the raw_length correctly, and possible stop reading too
early; so I wrote it to remember the pushed back token in a linked
list, so it's information could be used for the next get_next_token).

In diff_memory.c, the transformation is done to another buffer (the
size of which is first calculated to be big enough to hold the largest
token of the list).

>> So: that makes the token-approach again a little bit more possible.
>> But do we want it? It requires a lot more from implementors of
>> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
>> optimization to all implementors of svn_diff_fns_t ...
>>
>
> Another option is to add the "read backwards" callbacks to the API, but
> designate them as optional.  Then you only do the suffix half of the
> optimization of both/all sources support the backwards callbacks.
>
> (Is it a good idea?  Don't know.  But it's one option.)

Hm, maybe.

There is another problem though: prefix scanning uses the existing
callback "datasource_get_next_token". But it doesn't need the hash
that's calculated in there (I think the biggest speedup is coming from
not having to calculate the hash when we're comparing prefix/suffix
lines). So I've simply changed the implementation (both in diff_file.c
and diff_memory.c) to handle NULL as argument for **hash by not
calculating the hash. Existing code will probably not handle that, and
probably crash (or otherwise will calculate the hash and drop it on
the floor, but not providing any performance benifit).

Maybe it's better to have an explicit boolean flag "calculate_hash",
instead of passing NULL, but either way it will not be backwards
compatible with old implementors.

>> >> >> Any thoughts?
>> >> >>
>> >> >
>> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
>> >> > approach being that the comparison is done only after whitespace and
>> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
>> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
>> >> > these -x flags?
>> >> >
>> >> > What is the practical effect of the fact the -bytes approach doesn't
>> >> > take advantage of these flags?  If a file (with a moderately long
>> >> > history) has had all its newlines changed in rN, then I assume that your
>> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
>> >> > on that file, except for the single diff between rN and its predecessor?
>> >>
>> >> Yes, I thought that would be an important advantage of the tokens
>> >> approach, but as you suggest: the practical effect will most likely be
>> >> limited. Indeed, only this single diff between rN and its predecessor
>> >> (which may be 1 in 1000 diffs) will not benifit from
>> >> prefix/suffix-optimization. Even if the rate of such changes is like
>> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
>> >> every 100 revisions), it will hardly be noticeable.
>> >>
>> >> The perfectionist in me still thinks: hey, you can also implement this
>> >> normalization with the byte-based approach (in a streamy way). But
>> >> that will probably be quite hard, because I'd have to take into
>> >
>> > We can add that additional optimization after the basic -bytes support
>> > is working?  No need to implement all possible optimizations in one
>> > shot (let alone those of them that might have little relative benefit).
>>
>> Ok, that minor optimization can definitely wait.
>>
>>
>> So the choice still boils down to:
>> 1) A specific implementation inside diff_file.c (byte-based).
>>
>
> Not a bad option, I think.  The improvement isn't as useful in
> diff_memory.c, since that API isn't likely to be used for large
> amounts of data anyway (e.g., trunk uses it only for propery diffs).

Indeed. I don't know if property diffs will ever be large enough (and
in large enough quantities for a single operation) to have a
measurable difference. Maybe with large svn:mergeinfo properties,
being diffed 100's or 1000's of times during a single merge operation?
I have no idea.

>> 2) A generic implementation for all implementors of svn_diff_fns_t,
>> which still requires some work (get_previous_token), and imposes more
>> constraints on svn_diff_fns_t implementors ...
>>
>
> Might also be a not bad option... :-)

Cheers,

-- 
Johan
Received on 2010-12-02 15:00:19 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.