Changes, Differences, and States

From: Julian Foad <julianfoad_at_gmail.com>
Date: Wed, 14 Oct 2015 17:50:07 +0100

Here are some thoughts about concepts around changes, differences, states,
and 'no-op changes' which I will also refer to as 'touching'. It may help
our discussions regarding no-op changes if we spend a little time thinking
about this. This email does not have a firm conclusion. A couple of TL;DR
summary points are:

* there's more to this than first meets the eye;
* no-op changes are meaningful in the context of describing a single
state transition (roughly: a commit), and are not really meaningful in
comparing two arbitrary non-adjacent states;

SEQUENCE OF STATES (REVISIONS)
WHAT IS A STATE? WHAT IS A DELTA?
CHANGES VS. DIFFERENCES
WHAT KINDS OF NO-OP-CHANGE ARE THERE?
REBASE ON COMMIT
ARBITRARY DIFF VS. SINGLE-REVISION CHANGE

SEQUENCE OF STATES (REVISIONS)

Subversion is largely based on the idea of storing state snapshots. There
is an underlying idea that we can derive changes by comparing state
snapshots, and we can recreate state snapshots by applying changes.

diff(state_1, state_2) -> delta

apply(state_1, delta) -> state_2

# "diff()" = find the difference between two states
# "delta" = some representation of a difference

There is a basic concept that we should be able to use diff() and apply()
among any states (revisions) in a sequence:

    state_0
               delta_a
    state_1
               delta_b
    state_2
               delta_c
    state_3
               delta_d
    ...

diff(state_1, state_3) -> delta_1_3

apply(state_1, delta_1_3) -> state_3

It's not quite as simple as that, of course.

WHAT IS A STATE? WHAT IS A DELTA?

On the client side, here are some ways we can get a state snapshot out of
Subversion:

    "svn export" plus "svn proplist --recursive"
    "svn checkout"
    "svnrdump dump -r REV" (a non-incremental dump of one revision)

Each of these methods will give us the files, directories and properties,
which I would call the basic content of the versioned tree.

Let's now imagine that we export two trees and run some arbitrary
third-party directory-tree-comparison tool to get a representation of the
difference between them. (Either forget about the properties, for a moment,
or imagine the tool also supports them.) The output from this comparison
tool will show us deletes and adds and differences to files, and let's
suppose it explicitly mentions directories too. Of course it won't know
about moves or renames.

But Subversion does not only track basic tree content. Something is still
missing: the line of history -- the information that says whether the file
(or directory) at path P in state_2 is a continuation of the one at path P
in state_1, or is a continuation of some other line of history, say at path
Q in state_1, or is unrelated (a replacement). And that also explains why
the (imaginary) tree diff above won't know about moves and renames.

What would we need, in a "state" snapshot, in order to know which lines of
history are connected? If we're looking at two successive revisions, then
the history information that connects them could be represented by a
"Modify/Delete/Add/Replace" action-code, along with a "copied-from" source
reference. An incremental dump of the second revision would give us this
information for each node.

And for two snapshots that are not successive revisions, say state_1 and
state_3? The same "Modify/Delete/Add/Replace" action-code, along with
"copied-from", would work, if those metadata referred to the relationship
between state_1 and state_3. But note that these metadata are therefore not
static attributes of state_3. They are not state data, but rather are a
description of the relationship between the two given states, as reflected
in the term "action-code".

What if we want to attach metadata to each state, independently, that will
later allow us to properly compare two states and know the line-of-history
relationships among their nodes? To do that, we would need to attach the FS
node-revision identifier (node_id . copy_id . txn_id), or something very
much like it, to each node. By comparing the node_id and copy_id of each
node in state_1 with those in state_2 (or state_3, ...) we can know whether
each node's relationship is a succession, a replacement, a copy, and so on.
(Maybe we don't need to look at the txn_id part for this, but we'll come
back to that later.)

In concept, then, we could define also a kind of delta (the result of
comparing any two states) that incorporates line-of-history information,
whether by means of "M/D/A/R" action-codes or by incorporating the actual
FS node-revision identifiers or any other way of communicating the
connections. The exact form is not important right now, just the concept
that it's possible.

Do we have APIs that produce such deltas right now? For successive
revisions, certainly: "svnadmin dump --incremental" is one example. For an
arbitrary pair of revisions, I am not sure we have any such API:
svn_ra_do_diff3() for example does not report copy-from linkage. We lack,
too, a user interface that shows a formally complete delta including the
complete history information, although 'svn diff --git' now reports some
bits of it.

CHANGES VS. DIFFERENCES

Given any two stored states (revisions, let's say), we can look at the
difference between them. If asked to describe the difference, we have to
choose how to describe it: what kind of representation or 'delta' we want
to produce. We have lots of choices -- look how many output-control options
'svn diff' takes already:

  --no-diff-added
  --no-diff-deleted
  --ignore-properties
  --properties-only
  --show-copies-as-adds
  --notice-ancestry
  --summarize
  --xml
  --git

The best choice depends on what we want to do with this delta. Do we want
to be able to recreate the second state from the first state and the delta,
or to recreate either state from the other state and the delta, or do we
just want some kind of summary of the differences? Do we want a
minimum-sized delta, or one that includes enough context to be able to
apply it to a different but roughly similar initial state?

Despite all the choices, comparing two states and outputting a
representation of the difference between them is pretty straightforward.
The key unstated assumption here, in defining what it means to compare two
states, is that our input consists of just these two states and not the
entire history of the repository and their relationship to it. In reality,
in some cases the code that is comparing two states does have access to the
repository, and we may want to use that to define 'difference' a little
differently -- such as by comparing against any 'copy-from' sources.

Now let's consider the process of *creating* a new state -- that is, making
a commit.

We open up a commit editor, which drives changes to a "transaction" that is
based on some revision (which was the head revision at the time we started
the commit). We describe a set of changes to the state represented in that
transaction. What sort of changes do we describe? In the svn_delta_editor_t
interface, we use these methods in a depth-first recursive descent into the
path space, opening every directory that we descend into:

    delete_entry
    add_directory/file
    open_directory/file
    change_dir/file_prop
    apply_textdelta
    close_directory/file

Calling change_*_prop or apply_textdelta is an indication that we're
changing the properties or text of a node, but we might just "change" the
property or text to the same value that it already had. Also note that, at
some API levels, it is possible to finish one svn_delta_editor_t editing
session and then start another one on the same transaction, which opens up
the possibility of making all sorts of changes to the transaction such as
creating new files and directories, and subsequently removing them to leave
the final state (or parts of the final state) in the transaction exactly
the same as the initial state in the transaction's base.

Are no-op-changes made in this way meaningful? If the edit driver chose the
sequence of edits deliberately to control the creation of no-op-changes,
then yes they are meaningful. In practice, typically a Subversion client
(such as 'svn') drives the editor based on changes it has found in a
working copy. The Subversion WC does not provide a way of scheduling no-op
changes. Most versions of Subversion client software avoid making any
no-op-change edits when committing from a WC, but around the 1.5 to 1.7
time frame I understand 'svn' would often commit no-op property changes.
Meaningfully? I'm pretty sure not, but rather just as a side effect of the
way the code was written, not deliberately doing so, nor particularly
trying to avoid doing so.

WHAT KINDS OF NO-OP-CHANGE ARE THERE?

What evidence do we see of no-op changes on the client side (other than
through 'svnrdump dump')?

    the "log -v" changes list -- 'M' for modified, but not for copies
    the "log -v --xml" changes list -- also shows separate flags for text &
props
    the "last changed" revision/author/date in a WC -- per node

What kinds (or granularities) of no-op change do we think there are? How
about:

    text touched (for a file node)
    props touched (that is, the properties-list is 'touched')
    node-touched (with props maybe touched, text maybe touched?)

These kinds of no-op-change do seem to be representable in the current FSFS
backend. Maybe other kinds are too. These three kinds seem to be sufficient
to account for all the evidence of no-op changes on the client side that
we've noticed so far.

If we think there are these three different kinds of no-op change, then
what combinations of these do we distinguish, let's say for a file node?

    node: nothing (props: nothing, text: nothing);
    node: touched (props: touched, text: touched);
    node: touched (props: touched, text-nothing);
    node: touched (props: nothing, text-touched);
    node: changed (props: nothing, text: changed);
    node: changed (props: touched, text: changed);
    node: changed (props: changed, text: changed);
    node: changed (props: changed, text: touched);
    node: changed (props: changed, text: nothing);

Or should this additional state also be possible?

node: touched (props: nothing, text: nothing);

In principle, the kinds of no-op change that *could* be stored in the
back-end are the entire set of ultimately futile sub-sequences of driving
the commit API. It is easy to imagine that the BDB back-end may store
(intentionally or unintentionally) more details of the sequence of commit
API interactions than FSFS does. An even more loggy back-end might store
the entire sequence of commit API interactions. Remember that Subversion
back-ends are not controlled by us; there are third-party ones as well. In
general, then, recording no-op-changes is like a subset of logging the full
commit edit drive.

The important issue then is not whether the back-end does notice and record
such sequences of API interaction, but whether it intentionally and
consistently records and replays them through the APIs that we have
designed and defined.

Then we come to the issue of defining and testing the behaviour. It seems
to me we haven't defined the behaviour we're wanting, in any detail at all,
and don't have any tests (ok perhaps one since a week ago). If we're
completely clear in our minds what combination of behaviour we're
expecting, and think it's well tested in the real world, maybe we'll be ok
just changing the code to ... something ... and crossing our fingers. Is
that what we want to do?

REBASE ON COMMIT

When we make a commit, the changes we describe to the commit editor are not
the only changes that will go into the new revision. In order to commit the
transaction, we first need to rebase it -- that is, to merge it with the
changes made by any other transactions that have been committed in the
meantime. In doing so, the server will decide how to combine each base:txn
change with each base:head change. The standard three-way merge concept
says when no-difference is merged with a difference, the difference wins,
but that says nothing about touches. How shall we define the outcome of
merging the following pairs?

    base:head base:txn change to commit
    ========= ========= =========
    nothing nothing nothing
    touch nothing nothing
    difference nothing nothing
    nothing touch touch
    touch touch touch? nothing?
    difference touch touch? conflict?
    nothing difference difference
    touch difference difference
    difference difference conflict

Why -- what principle guides us to choose a particular outcome here?

Thinking more abstractly, what we have here is that a 'change' that at
first sight was composed by the client specifying a single set of changes
to a base state, is then in fact merged with another set of changes
(multiple revisions), and only the result of that merge is committed. The
concept of a commit representing a single set of changes as specified by
the client is thus not as true as it first seems.

ARBITRARY DIFF VS. SINGLE-REVISION CHANGE

Getting closer to the real problem we have...

As best I understand it, the idea of recording a no-op-change is meaningful
and relatively straightforward to define at the level of a single state
transition. We think of a commit as such a transition, and it is, but as
mentioned above it's not in general the exact same transition that the
client described.

Attempting to derive a notion of 'no-op-change' that applies to a
difference taken between an arbitrary pair of points in the version
history, on the other hand, is not at all straightforward, and we do not
have a concept of its meaning in relation to merging and so on.

Now, the "svn log -v" output clearly applies to a single commit, a single
state transition, and thus we find the indication of no-op changes there to
be somewhat satisfactory. The code that generates this output, on the other
hand, uses APIs that compare arbitrary points in history, such as

svn_fs_contents_changed(root1:path1, root2:path2)
svn_fs_props_changed

Comparing arbitrary points in history is an operation that, throughout
pretty much all of the version control system, is used really only when we
want and need to know about real changes. Hence the definition of a new
pair of APIs,

svn_fs_contents_different
svn_fs_props_different

to specifically provide that meaning.

What purpose remains for the original _changed() APIs, then? At first it
wasn't clear there was any real purpose, but if we want "svn log" etc. to
continue as before, then we need something like them. Except for this
purpose we don't need APIs that compare two arbitrary states; we need APIs
that compare two successive states, because this 'touched' concept only
makes sense in this context.

- Julian
Received on 2015-10-14 18:50:38 CEST

This message: [ Message body ]
Next message: Julian Foad: "Re: No-op changes no longer dumped by 'svnadmin dump' in 1.9"
Previous message: Evgeny Kotkov: "Re: No-op changes no longer dumped by 'svnadmin dump' in 1.9"
Next in thread: Branko Čibej: "Re: Changes, Differences, and States"
Reply: Branko Čibej: "Re: Changes, Differences, and States"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]