[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Moves in FSFS

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Sun, 29 Sep 2013 19:34:18 +0200

On Sat, Sep 28, 2013 at 12:36 PM, Branko Čibej <brane_at_wandisco.com> wrote:

> On 28.09.2013 12:06, Bill Tutt wrote:
>
> Move tracking can clearly make ones head spin. :)
>
> Brane: Could Julian and Stephan be asking about how to map between
> arbitrary revisions thinking of tree merge scenarios?
>
>
> There are two, somewhat related issues. The first involves understanding
> what actually happened to the tree, e.g., detecting that a certain change
> in the tree was a move, not something else. We realized more than 10 years
> ago that in our current top-down DAG model, getting this info just by
> analyzing the tree would be very expensive, and in some cases infeasible;
> so it doesn't really come as a surprise that looking just at the (node-id,
> copy-id) of a node revision is not enough.
>
> That's why the changes list was introduced (as the "changes" table in BDB,
> and an equivalent structure in FSFS); and I think we've come to the
> conclusion that it will have to be used to detect moves as well.
>
>
> That's what always gave me heart burn with the current Noderev behavior
> and envisioning real moves in Subversion.
>
>
> Yup.
>
>
> In thinking deep thoughts about this in the deep dark dark distant past
> I came up with several interesting thoughts: (sadly none of these are
> completely thought out answers, but they did seem like thoughts worth
> pondering wrt Subversion if you haven't before)
> * Have you considered having the noderev own the node's name instead of
> how the parent directory currently owns it? This way it would be natural to
> make the destination of the move a mutated noderev.
>
> Noderevs already know their full paths ("cpath") in FSFS and FSX.
Lazy copying always starts with a representation being shared not
with noderevs being shared as the top element.

>
> Yes. In fact I've made a stab at defining a different, IMO better model
> for storing versioned tree metadata; see:
>
>
> https://wiki.apache.org/subversion/Berlin2013?action=AttachFile&do=view&target=metadata-tng.pdf
>
> In this model, file names are represented as properties of the node, not
> its parent. It also does away with the necessity of having a separate
> changes list. But implementing it will be part of a larger effort which
> includes inventing a new API for the FS layer (in hindisght, we should
> never have exposed the top-down DAG as a public API).
>
>
> i.e. mv /trunk/A -> /trunk/B COULD do this in the DAG:
> * make /trunk/A mutable by ONLY updating the TxnID of the Noderev.
> * Edit the name field of the noderev from A ->B
> More complicated example:
> mv /trunk/A/B/C -> /trunk/A/D COULD do this to the DAG:
> * make /trunk/A/B/C mutable by ONLY updating the TxnID of the
> Noderev.
> * remove it as an entry from /trunk/A/B and add it as an entry to
> /trunk/A
> However... what do you do for this case: (C could have a zero copy_id here
> I think)
> mv /trunk/A/B/C -> /blah could do this to the DAG:
> ????
> or this one: (presuming C has a non-zero copy_id before move)
> mv /branch/A/B/C -> /blah2
> ????
> or (same as the last one, but C was the destination of the "soft-copy")
> ???
> The enumeration of the rest of the cases is left as an exercise for the
> reader. :)
>
>
> That PDF mentions many of these. :)
>

The thing is that we can do whatever we want inside a FS. The only
properties of IDs usably exposed through the FS API are:

* all IDs for path_at_rev must have a the same textual representation
* all IDs for path_at_rev must compare as "equal"
* if path exists continuously throughout a revision range (i.e. not being
  added, deleted or replaced), any pairs of IDs for that path@ any rev
  in that range must compare as "related"
* if path is added @rev *without* history, any ID of path_at_rev-X must
  compare as "unrelated" to any ID for path_at_rev.

To make things slightly trickier, the required information should be
part of the ID object itself since the compare function does not take
a pool parameter.

A more interesting question is how to report moves in our "loggy" API.
Do they create log intervals like copies? If yes, "--stop-on-copy" would
retrieve multiple intervals with different paths. If no, the path may change
within any of these intervals. Client code must be very careful to match
paths correctly in both cases. We are subtly changing the implied API
semantics here.

> * The implementation of rename tracking should ensure you enable all
> scenarios you desire for dealing with merging renamed items from branch A
> to branch B. (the tree shape part of the merge of course)
> That is, if you care, some systems don't/won't/did and then decided not
> to/whatever. Merging renames leads to lousy complex edge case tree
> conflicts that need usable (in the easy to use usability sense) resolution
> UI to solve. .
>
>
> Exactly. And this is the second issue of the two I mentioned: in order to
> make merge aware of moves, as opposed to copies, you have to be able to
> detect that two distinct path_at_revision in fact refer to the same branch
> of a node -- which is exactly the (node-id, copy-id) pair, and that /is/
> sufficient for the purpose. So again, there's no need to invent another
> branch-tracking identifier.
>

Sigh. The (node,copy) pair *alone* cannot be sufficient to map
move sources and targets. You always need some sort of context
information - such as provided by a recursive DAG walk.

Example:

* given /A/f with ID 1.2.3
* cp /A /B
* now /A/f and /B/f share the same DAG node with ID 1.2.3
* mv /B/f /c
* /c may now have DAG node ID 1.2.3 or 1.3.3 (if we expand the
  lazy copy upon move)

Question: What node got moved?
Answer: DAG node 1.2.3 (correct).

Question: What path got moved? (I may want to merge changes)
Answer: Maybe /A/f, maybe /B/f.

Immediate reason: tree_at_rev has more nodes than the whole DAG
and *any* node in tree_at_rev is a potential move source. Unless we
either do aways with lazy copies (hello CVS) or implicitly add new
DAG IDs in old revisions, there cannot be enough FS-level IDs for
a surjective mapping.

In-depth reason: Once you accept that the move semantics as
suggested by Julian's definition is what you want to support, *any*
ID scheme must either be isomorphic to the node-line ID approach
or a true superset of it. In the latter case you must provide auxiliary
rules (can be implicit) that effectively limit the superset to the original.
Otherwise, your implementation does not match the definition.

> (This is almost the opposite of detecting moves -- it's closer to
> detecting sameness, for some definition of "same".)
>
>
> * If moves are lazy, and you need to track moves differently from
> copies then consider: What if you expanded a noderev with a new move_id
> part? :)
>
>
> Here's the thing -- moves are never lazy. For all intents and purposes,
> they behave exactly as content or prop modifications. In fact, you can
> think of a node's name as one of its properties, instead of a property of
> its parent; although of course this is not what the current model does.
>
>
> Some systems solve this problem by recording in the merge history the
> (noderev, copyid) equivalents the merge history was recorded from. (Not
> just when merge was invoked, but also initial merge history for the initial
> copy.) That could also be a way of determining where /trunk/A/B/C/q.c is in
> /branch after a number of arbitrary number of directory hierarchy changes
> and file name alterations somewhere in /branch. These systems don't easily
> support merge history elision concepts though. (As well as requiring
> specialized merge history storage at the FS level for performance reasons.)
> Of course, I would think merge history could benefit from specialized
> storage anyway. :)
>
>
> Indeed, this is another thing we've been talking about, and another goal
> for the next generation versioned FS. We're well aware that the way we
> currently store mergeinfo is ... less than ideal.
>
>
> If you followed this approach the exact way the Noderev node_id/copy_id
> changes might be less relevant.
>
> Sorry for making your head hurt some more with these ancient nutty
> musings,
>
>
> Actually, I'd prefer to see more of your musings on this list ...
>

More input is certainly welcome.

-- Stefan^2.
Received on 2013-09-29 19:34:59 CEST

This is an archived mail posted to the Subversion Dev mailing list.