Re: Future direction for the diff editor

From: Neels J Hofmeyr <neels_at_elego.de>
Date: Wed, 28 Apr 2010 01:01:46 +0200

Daniel Näslund wrote:
> Hi!
>
> First, I've been accepted as a GSoC student for the summer of 2010. I'm
> really excited and look forward to a summer of coding.
>
> I'm supposed to implement the git unidiff format for 'svn diff' and
> 'svn patch' and I'll start with the diff side. The git unidiff format
> can represent tree changes but unfortunately the diff code in it's
> current state makes it hard to detect those tree changes.
>
> What to do?
> ------------
> 1) Just allow wc-wc diffs to create diffs with the git format. Use the
> available wc functions to retrieve info on tree changes.
> 2) Allow diffs involving the repos too by creating special ra functions
> for retrieving the missing information. Something like
> svn_ra_get_copyfrom_info().
> 3) Start revamping the diff code to not use an editor but instead return
> text-modified and props-modified nodes as detected by the server.
> [1]. In the mail, Greg makes a case for not using an editor in the
> diff code since nothing is modified. As I've understood it, an editor
> is used for almost all repos communication. I see the complexity
> involved in using an editor and I understand that sharing the same
> code for merge and diff has drawbacks. But I'm not seeing how we will
> decrease complexity by not using an editor. We'll still have to
> detect all those tree changes and we'll have to create additional
> code for doing it. If we would just have have to check for
> text-modified and props-modified things would have been different.

I think it makes sense to reuse the same API. The following section is
slightly off-topic to this thread. I'm leaping into the lands editor-v2 theory.

<brain-dump subject="editor-v2">
For example, let's compare merge and diff.
  svn merge -rA:B ^/branch
and
  svn diff -rA:B ^/branch
Both want to get the exact same information from the repos. But merge wants
to apply that to the working copy, and diff wants to print it out. (As of
this thread, diff is also interested in changes on the tree level.)
Furthermore, if, say, a node of the working copy is BASEd on rA, svn update
also wants to get the same information from the repos as diff wants; simplified:
  At revision A.
  svn update -rB

But the API must be suitable for reuse. I don't remember in detail, but a
long time ago I took a closer look at the diff and merge code, and it had an
amount of grown code around the shortcomings of the diff editor. Not good.

My romantic view is that with editor v2, we can have/mold such API that is
easy to adapt to the different tasks of update, merge, diff, switch... With
explicit, "atomic" replaces and moves in the API, we can shed the grown code
for the benefit of a firm API definition. The worst specimens of madness in
diff and merge that I've stumbled over would be gone with ev2. As we ripple
through the code, replacing the old editor with a more concise new one,
things become a lot easier, and code becomes shorter. Wishful thinking, of
course.

But I think it is possible and desirable to think of editor v2 as a hammer
and see nails in everything that involves getting the tree/text difference
between two subtrees/revisions. I see generalized drivers; one that can
generate editor calls by comparing a WC state to a repos state, one that can
compare two repos states, one that can compare two WC states, ... And then
the different subcommands "simply" implement their callbacks of choice and
take care to ask the right driver on the right revisions. E.g., the diff
callbacks, once implemented, could provide arbitrary diffs simply by asking
a different generalised driver type to generate callback calls (I'm thinking
specifically of wc-actual-against-any-url diffs).

There's also a problem with my views. Editor v2, as intended in its design,
gives the callback receivers only full texts, never text delta data. The
idea is that all delta-ing to get to B is hidden behind the API. In case of
an update, the driver fires all events necessary to get from tree A to tree
B, and provides full texts of B. The driver can choose any way it wants to
get to the full texts of B. Fair enough.
But for text diffs, that means that I receive all events that are necessary
to get from tree A to tree B (good) -- and then receive full texts of B from
the driver, after which I have to fetch full texts of A from <blackbox> and
work out the text-diff from those (bad). Are you following? Doesn't sound so
romantic anymore. The editor does not provide the difference, but the result
of applying the difference? I still haven't entirely wrapped my head around
the generalised case of that.

An advantage is that the API can choose to get at the full texts of B any
way it likes, e.g. from the pristine store via a pristine checksum match, or
taking a shortcut via some other revision. The disadvantage, illustrated in
an example: If I want diff to tell me the difference of my working-copy BASE
and the repos' HEAD (== what update would apply to the WC), and say I have
thousands of huge text files, each of which have a change of only a single
line. The driver would construct each huge text file completely with the
first line adapted, then the diff callback implementations would read each
original huge text file from BASE and compare the two. But, all the time,
the repos knew that exactly only the first line was changed. There was no
need to pass these huge amounts of data through the API functions (locally).

So, in a nutshell, editor v2, as it is outlined today, isn't always that
suitable for communicating text-diffs -- if the driver already knows the
text-diff, it has no way of telling the receiver about it. The driver must
provide the full text result and the receiver must work out the text-diff
from that.

Maybe that's how it was intended to be: one editor type for getting to a
given revision in full (delta_editor_t for update, switch) and one for
getting the differences between to revisions/paths (diff_editor for diff,
merge). In my head, they are still both very much related. I have unfinished
business with this topic... and a bunch of homework left to do before I can
start making any real sense to Greg.
</brain-dump>

> 4) Wait by the roadside for editor-v2 to be finished. It is supposed to
> automatically detect tree changes.

And that's the problem: you have further plans for diff, which, like other
things before, want more info than the diff-editor can offer (notably diff
and merge don't use the delta_editor at all but implement their own
diff_editor). All previous attempts (e.g. detecting replaces) deflected at
an early stage, grew some nasty compromise code and went on without going
into depth. Understandable, given the size of considerations, but Bad.

I think if you want to take on (an) editor v2 for diff and merge, that's the
"Best" way to start. But it'll be a lot of work, including theoretical. If
you want to get anything done soonish, I think it would indeed be best to
start playing around with a wc-wc diff, maybe structuring the code in
anticipation of a move towards editor vN-markM. Avoiding implementation of
uniform API across subcommands for sending tree-/text-differences is, again,
probably Bad, but understandable.

(Note that currently in wc-wc, only diff between @base and the actual
working copy is implemented.)

> Has anyone given any more thoughts to how the diff code could be
> improved?

I'm buried up to the neck in them... :/
Haven't had the time [1] to actually start implementing them.

[1] read: guts

~Neels

application/pgp-signature attachment: OpenPGP digital signature

Received on 2010-04-28 01:02:21 CEST

This message: [ Message body ]
Next message: Greg Stein: "Re: svn commit: r33994 - in trunk/subversion: libsvn_client libsvn_repos tests/cmdline"
Previous message: Paul Burba: "svndumpfilter --renumber-revs and --drop-empty-revs"
In reply to: Daniel Näslund: "Future direction for the diff editor"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]