[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: diff wish

From: Morten Kloster <morklo_at_gmail.com>
Date: Wed, 15 Jun 2011 17:46:27 +0200

On Wed, Jun 15, 2011 at 1:08 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 5:33 PM, Stefan Sperling <stsp_at_elego.de> wrote:
>> On Tue, Jun 14, 2011 at 05:21:27PM +0200, Neels J Hofmeyr wrote:
>>> Hi Johan,
>>>
>>> it's been a while and I still haven't sent you my diff wish we briefly
>>> touched on the Subversion hackathon.
>
> Hi Neels, thanks for pursuing this further.
>
>>> Here is a fabricated example of why I don't like diff to match empty lines:
>>
>>> A couple of lines get replaced by completely different ones. By matching the
>>> blank line in the middle, it becomes far less readable, IMHO. In my fantasy
>>> dream world, this diff would print:
>>>
>>> [[[
>>> Index: x
>>> ===================================================================
>>> --- x (revision 1)
>>> +++ x (working copy)
>>> @@ -4,11 +4,13 @@
>>>
>>>  void aaa()
>>>  {
>>> -  if (x)
>>> -    do(things);
>>> -
>>> -  if (y)
>>> -    do(stuff);
>>> +  while (x || y)
>>> +  {
>>> +    check(something);
>>> +    notify(stuff);
>>> +
>>> +    try(somethingelse);
>>> +  }
>>>
>>>    bb(b);
>>>  }
>>> ]]]
>
> Yeah, that's certainly a nicer diff for human consumption :-). But
> strictly speaking it's a larger diff (more lines marked as +/-), so
> that makes it less optimal for the current algorithm.
>
> The "minimality" criterion of diff (with the LCS) makes it easy to
> reason about, and makes for a nice and clear mathematical definition
> of the requested diff result. But I agree that it doesn't necessarily
> lead to "good-quality" diffs for human readers.
>
> So: good-quality != minimal, but it's more of a "soft" criterion,
> depends on the language, context, ... what lines are important to the
> user, ...
>

Only for a given definition of "minimal" :-). In computer science, it
makes as much sense or more to let minimal mean the amount of
information needed to encode the diff. With that definition, it is worth
less information to match common lines than uncommon lines (with
unique lines being worth the most), and very common lines are only
worth matching if the surrounding lines also match. The minimal diff
in that sense would also be of high quality from a human perspective.
The downside is that finding a minimal diff in that sense is much
harder (and the precise definition of "minimal" depends on the
encoding used, while the optimal encoding in turn depends on
the statistics of the data we want diffed... so, yeah, it gets messy).

> Introducing heuristics in one form or another is probably the only way
> to improve this. I'm not an expert in this area myself (I'm actually
> more of a theoretical mathematician, so I'm naturally skeptical of
> anything without a formal proof :-)). But I have also read some good
> things about patience diff, like Stefan suggested ...
>
>> Do you know about patience diff?
>> http://bramcohen.livejournal.com/73318.html
>> I think we should try teaching this algorithm to svn diff at some point.
>> It's a lot more generic than just checking for empty lines and should
>> yield the results you want.
>
[]
>
> Intuitively, I'd say: let's look into patience diff (or a variation
> thereof), because it's already being used in several (D)VCS'es, so it
> has already had a lot of exposure. But that's not really a strong
> argument :-). Maybe another approach is easier to implement in
> libsvn_diff, and yields equally good or even better results ... I
> don't know.
>

Actually, patience diff doesn't solve this issue at all - once it has found
an optimal match for the unique lines, it then performs regular minimal
matches on the remaining sections, and it will be about as likely to
generate spurious matches of blank lines as our current diff when
there are large sections of non-matching code (or it can find spurious
matches of unique lines, which can mess up things even worse).

> []

---
Morten
Received on 2011-06-15 17:47:00 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.