Diff optimizations progress (or not)

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sat, 20 Nov 2010 22:20:17 +0100

[ adding dev@ to discussion between danielsh and me about
(non-)progress of the diff-optimizations work. ]

On Wed, Nov 17, 2010 at 3:21 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> [ if you want to, you can add dev@ to the CC on replies. ]
>
> Johan Corveleyn wrote on Wed, Nov 17, 2010 at 00:57:53 +0100:
>> I have made some progress locally, but not enough to commit it yet:
>> - The implementation I've sent in my latest patch-mail to the list
>> works well, but needs some final wrap up (to make it work with diff3
>> and diff4). That is: if we decide this implementation is the way to
>> go. (I also made a slight variation by implementing the critical
>> functions (incrementing pointers) with a macro, which made it another
>> 10-15% faster).
>>
>
> As I said on IRC: don't wait for everything to be gift-wrapped before
> you commit. That is not our expectation from you, and it isn't a good
> approach to collaborative development of moderately large changes.
>
> If you have an implementation that works for diff2 and doesn't break
> diff3+diff4, commit it, and extend it from 2-sources to N-sources in
> subsequent commits. If you have something that works and you think can
> be made faster, why not commit it now and then make another commit that
> implements the optimization?

Actually, this approach (the one that eliminates identical
prefix/suffix byte-per-byte, when opening the datasources) was already
almost finished with the latest patch I sent to the list. It was
already extended to handle N datasources. The only thing left to do
was to make diff3+diff4 handle the results in a meaningful way (I
punted on this for now, because I don't understand those algorithms
very well).

But ultimately, I was really doubtful about the low-level approach
that I took, and would really prefer the token-approach if that would
work out equally well (I think it will be better, more readable,
faster, more flexible, ...).

>> - However, as I already hinted in that mail thread: I'm not satisfied
>> with that implementation, and I'm attempting another way in the
>> "token-parsing" layer (working with entire tokens/lines, instead of
>> byte-per-byte). It's this "new way" that I'd like to get working, and
>> commit to the branch.
>>
>
> Committing an approach to a branch doesn't commit you to that approach.
>
> If you have working code (as in: doesn't break anything), which is
> a step on the way to implementing what --- at that point in time ---
> you'd like the branch to implement, and the code is commit-worthy (i.e.,
> reviewable, documented, and consists of a single logical change) ---
> then what reason do you have not to commit it?
>
> If you think the code could be improved later --- or even that a totally
> different approach would be better --- then do investigate those. But
> --- in order to keep others in the loop (which calls for *small* --- and
> thus frequent --- updates), or to checkpoint your progress, or just
> because the investigation is going to take some time --- you'll first
> commit the piece of code that *is* working and *is* an improvement, and
> later see how it can be improved further, than not commit it at all and
> later bomb dev@ with a patch plus a too-big-to-be-reviewable "Here are
> N other approaches that I ruled out".
>
> In short, unlike on trunk, on aptly-named "experimental" branches (from
> "to experiment") you don't have to be certain a patch is 1.8.0-worthy in
> order to commit it.
>
> Iterating in public on dev@ enables the whole process (including the
> decision to scrub, if necessary) to benefit from everyone's inputs, and
> to get those inputs as early as possible. It allows people to chime in
> if they see you take an approach they know won't help, and it allows you
> to turn to the list for advice then you get stuck.
>
>
> [1] You may create as many branches as you need. (You don't need to
> ask permission.) If organizing your code on two branches makes your
> life easier, go for it!

So I have the byte-approach nearly finalized, but I don't like it as
much as the token-approach, which still needs a lot of work :-(. And
they are mutually exclusive (doesn't make sense to implement both), so
I wouldn't like to commit both approaches to the same branch (I would
have to undo the first approach before committing the second one). So
maybe the way to go is to have two branches for the two
implementations ...

I could then commit the byte-approach (which is nearly finished) in
small steps (making it easier to review) in one branch, and the new
token-approach to the other (that is, as soon as I have something
commitable, see below). We can then discuss which of the two
approaches would be preferred...

>> - With the "token-way", I almost have the prefix part working (ran
>> into some problem when "rolling back" the token at which the first
>> difference is encountered (making svn crash); I know how to solve it,
>> but I just need to do it).
>>
>
> You could commit the token-way without the prefix part --- with just
> suffix-scan enabled --- and implement prefix scanning in a subsequent
> commit.
>
>> - The suffix part still needs to be fully implemented (including going
>> backwards through the "tokens").
>>
>
> Does it make sense to commit the patch with neither prefix-scanning nor
> suffix-scanning? i.e., is there any substance left in it without those
> two?

No, there is no suffix-scanning yet, only prefix-scanning which
sometimes crashes :-). I think the first meaningful thing I can commit
is a working implementation of the prefix-scanning (at least something
that compiles and doesn't crash). I have the feeling I'm almost there
...

Cheers,

-- 
Johan

Received on 2010-11-20 22:21:19 CET

This message: [ Message body ]
Next message: Larry McVoy: "svn update select(2)"
Previous message: Daniel Shahaf: "Re: svn commit: r1036978 - in /subversion/trunk/subversion/tests/cmdline: svnsync_tests.py svnsync_tests_data/descend_into_replace.dump"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]