[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: diff wish

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 15 Jun 2011 14:11:10 +0200

On Wed, Jun 15, 2011 at 1:24 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Wed, 2011-06-15 at 13:16 +0200, Johan Corveleyn wrote:
>> On Wed, Jun 15, 2011 at 12:53 PM, Stefan Sperling <stsp_at_elego.de> wrote:
>> > On Wed, Jun 15, 2011 at 12:34:31PM +0200, Branko Čibej wrote:
>> >> I'd say not to worry about --minimal and --nice and whatnot. Just make
>> >> diff output the sanest, nicest diff it can find. I think it's a bad idea
>> >> to give diff user-visible options that change the output in ways that
>> >> are hard to explain (shuffling lines around, as opposed to, e.g., using
>> >> a completely different diff format).
>> >
>> > +1
>>
>> Certainly we need to pick the best possible default, which satisfies
>> most users most of the time.
>>
>> But I'm not convinced that we should simply drop support for "minimal
>> diffs" when we arrive at the point that we have a "nice" format. A
>> "nice" diff will always be based on heuristics, taking guesses at what
>> should be considered a deletion, an addition, or a common line. It's a
>> matter of interpretation. So there will always be a chance that it
>> guesses wrong, and totally mis-synchronizes. It may be rare, but IMHO
>> it's impossible to completely avoid this.
>>
>> The minimal diff can produce ugly diffs, but there is one certainty:
>> it's always a minimal one.
>
> But so what?  It's only "minimal" according to the current definition of
> "minimal" which is something like "number of lines removed + number of
> lines added".  A "better" solution might have a "better" definition of
> "minimal", maybe involving something like "total number of unique groups
> of characters".

The current implementation is "minimal" according to the *most
natural* definition of "minimal" for diff output, which is still line
based. Ok, you can have another definition of "minimal", but it will
always have its vulnerabilities in certain cases. That's mainly my
point.

It's very tricky to come up with a good specification (or mathematical
definition) of what you consider a "better" diff. And then someone
comes along with an example where your "better" diff produces a very
ugly result (although it will still be "minimal" according to your
"nice" specification). Also, keep in mind that people use Subversion
for lots of things that can be vastly different from source code.

That's why I think it would be very unwise to ever simply drop support
for "standard minimal diffing", something that carries with it at
least 30 years of CS research, and has had a *huge* amount of
applications and usage.

As for what I understand under missynchronization ...

On Wed, Jun 15, 2011 at 1:30 PM, Markus Schaber
<m.schaber_at_3s-software.com> wrote:
> If "mis-synchronizes" means that it produces a broken output when applied on the input, then this should be avoided for every price. A "nice" diff must still be a valid diff producing the correct output. But AFAICS this was never questioned.
>
> If you have a different definition of "mis-synchronizes", please explain.

No, I don't mean a broken diff. The diff should at all times be
*correct*. That was indeed never questioned.

I mean something like the example Neels gave with his initial approach
for avoid the mis-matching empty line problem. With the naive
solution, he gave an example of where it's not nice:

On Tue, Jun 14, 2011 at 5:21 PM, Neels J Hofmeyr <neels_at_elego.de> wrote:
> The adverse effects of that is that any single line change shows any
> following empty lines as also changed:
>
> [[[
> context
> -foo
> -
> -
> +foobar
> +
> +
> Context
> ]]]
>
> and that empty-line-changes also show their preceding non-empty line as
> changed even if it hasn't:
>
> [[[
> context
> -foo
> -
> +foo
> +
> +
> Context
> ]]]

Similarly, with patience diff, one can easily come up with examples
where it produces a result that certainly seems sub-optimal to me.
This example was given in a comment to a blog entry about patience
diff [1]:

[[[
file a

aaaaaa
aaaaaa
bbbbbb
bbbbbb
cccccc
cccccc
abc

file b

abc
aaaaaa
aaaaaa
bbbbbb
bbbbbb
cccccc
cccccc

would the matching result be

-aaaaaa
-aaaaaa
-bbbbbb
-bbbbbb
-cccccc
-cccccc
abc
+aaaaaa
+aaaaaa
+bbbbbb
+bbbbbb
+cccccc
+cccccc
]]]

Ok, this is still "minimal" according to the definition of "patience
diff" (get the maximum amount of common lines when only looking at the
lines which are unique to each side). But I think you would agree it's
not a nice diff.

The "standard minimal diff" algorithm would produce the more sensible
diff in this case:
[[[
+abc
aaaaaa
aaaaaa
bbbbbb
bbbbbb
cccccc
cccccc
-abc
]]]

-- 
Johan
[1] http://alfedenzo.livejournal.com/170301.html?thread=195901#t195901
Received on 2011-06-15 14:12:01 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.