# Re: Changes, Differences, and States

From: Branko Čibej <brane_at_apache.org>
Date: Sat, 17 Oct 2015 12:07:22 +0200

On 14.10.2015 18:50, Julian Foad wrote:
> Here are some thoughts about concepts around changes, differences, states,
> and 'no-op changes' which I will also refer to as 'touching'. It may help
> our discussions regarding no-op changes if we spend a little time thinking
> summary points are:
>
> * there's more to this than first meets the eye;
> * no-op changes are meaningful in the context of describing a single
> state transition (roughly: a commit), and are not really meaningful in
> comparing two arbitrary non-adjacent states;
>
>
> SEQUENCE OF STATES (REVISIONS)
> WHAT IS A STATE? WHAT IS A DELTA?
> CHANGES VS. DIFFERENCES
> WHAT KINDS OF NO-OP-CHANGE ARE THERE?
> REBASE ON COMMIT
> ARBITRARY DIFF VS. SINGLE-REVISION CHANGE
>
>
>
> SEQUENCE OF STATES (REVISIONS)
>
> Subversion is largely based on the idea of storing state snapshots. There
> is an underlying idea that we can derive changes by comparing state
> snapshots, and we can recreate state snapshots by applying changes.
>
> diff(state_1, state_2) -> delta
>
> apply(state_1, delta) -> state_2
>
> # "diff()" = find the difference between two states
> # "delta" = some representation of a difference
>
> There is a basic concept that we should be able to use diff() and apply()
> among any states (revisions) in a sequence:
>
> state_0
> delta_a
> state_1
> delta_b
> state_2
> delta_c
> state_3
> delta_d
> ...
>
> diff(state_1, state_3) -> delta_1_3
>
> apply(state_1, delta_1_3) -> state_3
>
> It's not quite as simple as that, of course.

One consideration that makes it complicated in theory is that "delta" is
not exactly one transformation but any one of a family of possible
transformations that have the same beginning and end state. In practice,
in Subversion, we tend to create the same transformation between any two
concrete state pairs.

The other consideration, as you say below, is that we're also dealing
with ... call it "object permanence" in the sense that each state is
actually a set of snapshots of concrete objects and (hierarchic)
relations between them that have an existence beyond a single state; so
the delta transformation must also describe object creation and death
(and resurrection) and changes in object relations.

> WHAT IS A STATE? WHAT IS A DELTA?
>
> On the client side, here are some ways we can get a state snapshot out of
> Subversion:
>
> "svn export" plus "svn proplist --recursive"
> "svn checkout"
> "svnrdump dump -r REV" (a non-incremental dump of one revision)
>
> Each of these methods will give us the files, directories and properties,
> which I would call the basic content of the versioned tree.
>
> Let's now imagine that we export two trees and run some arbitrary
> third-party directory-tree-comparison tool to get a representation of the
> difference between them. (Either forget about the properties, for a moment,
> or imagine the tool also supports them.) The output from this comparison
> tool will show us deletes and adds and differences to files, and let's
> suppose it explicitly mentions directories too. Of course it won't know
>
> But Subversion does not only track basic tree content. Something is still
> missing: the line of history -- the information that says whether the file
> (or directory) at path P in state_2 is a continuation of the one at path P
> in state_1, or is a continuation of some other line of history, say at path
> Q in state_1, or is unrelated (a replacement). And that also explains why
> the (imaginary) tree diff above won't know about moves and renames.
>
> What would we need, in a "state" snapshot, in order to know which lines of
> history are connected? If we're looking at two successive revisions, then
> the history information that connects them could be represented by a
> "Modify/Delete/Add/Replace" action-code, along with a "copied-from" source
> reference. An incremental dump of the second revision would give us this
> information for each node.
>
> And for two snapshots that are not successive revisions, say state_1 and
> state_3? The same "Modify/Delete/Add/Replace" action-code, along with
> "copied-from", would work, if those metadata referred to the relationship
> between state_1 and state_3. But note that these metadata are therefore not
> static attributes of state_3. They are not state data, but rather are a
> description of the relationship between the two given states, as reflected
> in the term "action-code".

Yes. That's why a "delta" is a function, not some data; your functional
representation above would better be expressed like this:

diff(state_a, state_b) -> delta_a_b
delta_a_b(state_a) -> state_b
~delta_a_b(state_b) -> state_a

in other words, "diff()" produces transformation functions, not input to
a generic "apply()" function; and the transformation functions have
inverse pairs.

> What if we want to attach metadata to each state, independently, that will
> later allow us to properly compare two states and know the line-of-history
> relationships among their nodes? To do that, we would need to attach the FS
> node-revision identifier (node_id . copy_id . txn_id), or something very
> much like it, to each node. By comparing the node_id and copy_id of each
> node in state_1 with those in state_2 (or state_3, ...) we can know whether
> each node's relationship is a succession, a replacement, a copy, and so on.
> (Maybe we don't need to look at the txn_id part for this, but we'll come
> back to that later.)
>
> In concept, then, we could define also a kind of delta (the result of
> comparing any two states) that incorporates line-of-history information,
> whether by means of "M/D/A/R" action-codes or by incorporating the actual
> FS node-revision identifiers or any other way of communicating the
> connections. The exact form is not important right now, just the concept
> that it's possible.
>
> Do we have APIs that produce such deltas right now? For successive
> revisions, certainly: "svnadmin dump --incremental" is one example. For an
> arbitrary pair of revisions, I am not sure we have any such API:
> svn_ra_do_diff3() for example does not report copy-from linkage. We lack,
> too, a user interface that shows a formally complete delta including the
> complete history information, although 'svn diff --git' now reports some
> bits of it.

But we do have an API that can transform any revision into any other
revision: it's called 'svn update'. And by association, the delta editor
is the API that produces the arbitrary delta. You don't have to think in
terms of a single API or in terms of any particular output our commands
produce.

> CHANGES VS. DIFFERENCES
>
> Given any two stored states (revisions, let's say), we can look at the
> difference between them. If asked to describe the difference, we have to
> choose how to describe it: what kind of representation or 'delta' we want
> to produce. We have lots of choices -- look how many output-control options
>
> --no-diff-deleted
> --ignore-properties
> --properties-only
> --notice-ancestry
> --summarize
> --xml
> --git
>
> The best choice depends on what we want to do with this delta.

I think it's misleading to call the output of 'svn diff' a "delta" ...
given how you defined "delta" earlier. Not a single variant of "svn
diff" can actually produce output that can be used to transform one
state (revision) to another.

> Do we want
> to be able to recreate the second state from the first state and the delta,
> or to recreate either state from the other state and the delta, or do we
> just want some kind of summary of the differences? Do we want a
> minimum-sized delta, or one that includes enough context to be able to
> apply it to a different but roughly similar initial state?
>
> Despite all the choices, comparing two states and outputting a
> representation of the difference between them is pretty straightforward.
> The key unstated assumption here, in defining what it means to compare two
> states, is that our input consists of just these two states and not the
> entire history of the repository and their relationship to it. In reality,
> in some cases the code that is comparing two states does have access to the
> repository, and we may want to use that to define 'difference' a little
> differently -- such as by comparing against any 'copy-from' sources.
>
> Now let's consider the process of *creating* a new state -- that is, making
> a commit.
>
> We open up a commit editor, which drives changes to a "transaction" that is
> based on some revision (which was the head revision at the time we started
> the commit). We describe a set of changes to the state represented in that
> transaction. What sort of changes do we describe? In the svn_delta_editor_t
> interface, we use these methods in a depth-first recursive descent into the
> path space, opening every directory that we descend into:
>
> delete_entry
> open_directory/file
> change_dir/file_prop
> apply_textdelta
> close_directory/file
>
> Calling change_*_prop or apply_textdelta is an indication that we're
> changing the properties or text of a node, but we might just "change" the
> property or text to the same value that it already had. Also note that, at
> some API levels, it is possible to finish one svn_delta_editor_t editing
> session and then start another one on the same transaction, which opens up
> the possibility of making all sorts of changes to the transaction such as
> creating new files and directories, and subsequently removing them to leave
> the final state (or parts of the final state) in the transaction exactly
> the same as the initial state in the transaction's base.
>
> Are no-op-changes made in this way meaningful? If the edit driver chose the
> sequence of edits deliberately to control the creation of no-op-changes,
> then yes they are meaningful. In practice, typically a Subversion client
> (such as 'svn') drives the editor based on changes it has found in a
> working copy. The Subversion WC does not provide a way of scheduling no-op
> changes. Most versions of Subversion client software avoid making any
> no-op-change edits when committing from a WC, but around the 1.5 to 1.7
> time frame I understand 'svn' would often commit no-op property changes.
> Meaningfully? I'm pretty sure not, but rather just as a side effect of the
> way the code was written, not deliberately doing so, nor particularly
> trying to avoid doing so.
>
>
> WHAT KINDS OF NO-OP-CHANGE ARE THERE?
>
> What evidence do we see of no-op changes on the client side (other than
> through 'svnrdump dump')?
>
> the "log -v" changes list -- 'M' for modified, but not for copies
> the "log -v --xml" changes list -- also shows separate flags for text &
> props
> the "last changed" revision/author/date in a WC -- per node
>
> What kinds (or granularities) of no-op change do we think there are? How
>
> text touched (for a file node)
> props touched (that is, the properties-list is 'touched')
> node-touched (with props maybe touched, text maybe touched?)
>
> These kinds of no-op-change do seem to be representable in the current FSFS
> backend. Maybe other kinds are too. These three kinds seem to be sufficient
> to account for all the evidence of no-op changes on the client side that
> we've noticed so far.
>
> If we think there are these three different kinds of no-op change, then
> what combinations of these do we distinguish, let's say for a file node?
>
> node: nothing (props: nothing, text: nothing);
> node: touched (props: touched, text: touched);
> node: touched (props: touched, text-nothing);
> node: touched (props: nothing, text-touched);
> node: changed (props: nothing, text: changed);
> node: changed (props: touched, text: changed);
> node: changed (props: changed, text: changed);
> node: changed (props: changed, text: touched);
> node: changed (props: changed, text: nothing);
>
> Or should this additional state also be possible?
>
> node: touched (props: nothing, text: nothing);
>
> In principle, the kinds of no-op change that *could* be stored in the
> back-end are the entire set of ultimately futile sub-sequences of driving
> the commit API. It is easy to imagine that the BDB back-end may store
> (intentionally or unintentionally) more details of the sequence of commit
> API interactions than FSFS does. An even more loggy back-end might store
> the entire sequence of commit API interactions. Remember that Subversion
> back-ends are not controlled by us; there are third-party ones as well. In
> general, then, recording no-op-changes is like a subset of logging the full
> commit edit drive.
>
> The important issue then is not whether the back-end does notice and record
> such sequences of API interaction, but whether it intentionally and
> consistently records and replays them through the APIs that we have
> designed and defined.
>
> Then we come to the issue of defining and testing the behaviour. It seems
> to me we haven't defined the behaviour we're wanting, in any detail at all,
> and don't have any tests (ok perhaps one since a week ago). If we're
> completely clear in our minds what combination of behaviour we're
> expecting, and think it's well tested in the real world, maybe we'll be ok
> just changing the code to ... something ... and crossing our fingers. Is
> that what we want to do?
>
>
> REBASE ON COMMIT
>
> When we make a commit, the changes we describe to the commit editor are not
> the only changes that will go into the new revision. In order to commit the
> transaction, we first need to rebase it -- that is, to merge it with the
> changes made by any other transactions that have been committed in the
> meantime. In doing so, the server will decide how to combine each base:txn
> change with each base:head change. The standard three-way merge concept
> says when no-difference is merged with a difference, the difference wins,
> but that says nothing about touches. How shall we define the outcome of
> merging the following pairs?
>
> base:head base:txn change to commit
> ========= ========= =========
> nothing nothing nothing
> touch nothing nothing
> difference nothing nothing
> nothing touch touch
> touch touch touch? nothing?
> difference touch touch? conflict?
> nothing difference difference
> touch difference difference
> difference difference conflict
>
> Why -- what principle guides us to choose a particular outcome here?
>
> Thinking more abstractly, what we have here is that a 'change' that at
> first sight was composed by the client specifying a single set of changes
> to a base state, is then in fact merged with another set of changes
> (multiple revisions), and only the result of that merge is committed. The
> concept of a commit representing a single set of changes as specified by
> the client is thus not as true as it first seems.
>
>
> ARBITRARY DIFF VS. SINGLE-REVISION CHANGE
>
> Getting closer to the real problem we have...
>
> As best I understand it, the idea of recording a no-op-change is meaningful
> and relatively straightforward to define at the level of a single state
> transition. We think of a commit as such a transition, and it is, but as
> mentioned above it's not in general the exact same transition that the
> client described.
>
> Attempting to derive a notion of 'no-op-change' that applies to a
> difference taken between an arbitrary pair of points in the version
> history, on the other hand, is not at all straightforward, and we do not
> have a concept of its meaning in relation to merging and so on.
>
> Now, the "svn log -v" output clearly applies to a single commit, a single
> state transition, and thus we find the indication of no-op changes there to
> be somewhat satisfactory. The code that generates this output, on the other
> hand, uses APIs that compare arbitrary points in history, such as
>
> svn_fs_contents_changed(root1:path1, root2:path2)
> svn_fs_props_changed
>
> Comparing arbitrary points in history is an operation that, throughout
> pretty much all of the version control system, is used really only when we
> want and need to know about real changes. Hence the definition of a new
> pair of APIs,
>
> svn_fs_contents_different
> svn_fs_props_different
>
> to specifically provide that meaning.
>
> What purpose remains for the original _changed() APIs, then? At first it
> wasn't clear there was any real purpose, but if we want "svn log" etc. to
> continue as before, then we need something like them. Except for this
> purpose we don't need APIs that compare two arbitrary states; we need APIs
> that compare two successive states, because this 'touched' concept only
> makes sense in this context.
>
>
> - Julian
>