Thanks for the clarifying mail, Bill. It helped a lot.
Below, Mike and I (we're writing this together) will respond briefly
to a couple of the things you said, and then describe the proposal
we've worked out. We *think* it's the same thing you've been
proposing all along, we just want to describe it in our own words to
> Relatedness: Needs to be very fast. O(1)
> Immediate Predecessor: Needs to be very fast, but could be slightly
> slower than the Relatedness check if necessary. O(1)
> Ancestry check: Needs to be efficient, but by necessity is going to be
> more expensive then the immediate predecessor check. O(n)
+1 on all of these.
Note that the ancestry check is really only used during merge(), that
is, during commits. If entry "somedir/blah" has NodeID Y in the txn
being committed, and NodeID X in head tree being merged into that txn,
then we need to make sure X is an ancestor of Y, so that when the
commit replaces X with Y, we know we're not losing any history.
(You probably already knew this, I'm just describing it for my own
sanity and that of others on this list.)
Therefore, we think we can get away with storing the ancestors in a
distributed fashion, as a chain: each node knows its immediate
predecessor (or "precessors", in the future), and you walk back until
you have your answer. In real life, we won't usually be walking back
too far, and, as we'll describe below, there's a technique by which we
can bound the search even further.
As you pointed out, we *could* store a node's full ancestry set for
each revision of that node, in a separate table or something. If
performance demands that, we will, but we're guessing it's not needed.
(And, as we just discussed on the phone, some of tree.c:merge() can be
made more efficient, to minimize the impact of an O(N) ancestry
> However, the current is_related() check is implemented by performing an
> Ancestry check. So, the current performance of is_related() is currently
> O(n) as opposed to what it hopefully should be: O(1).
Yup, this becomes O(1) in the new proposal.
Okay, now we're going to describe the plan we've come up with, which
will sound suspiciously like the plan *you* came up with. Funny
- We're describing this in terms of the BDB implementation, for the
most part. Translate as necessary.
- This proposal supports one copy (a.k.a. branch) operation. You
can call it anything you want: "copy", "branch", "split",
"pumpkin", whatever. We're calling it "copy" :-). It is the SCM
Here we go...
First, a node's key consists of three parts
This is a bit implementation-specific, but you get the idea. The
"txnID" is just a unique identifier, but we happened to inherit it
from the fs txn, and we needed a name for that portion, so... :-)
Also, the copyID could live in the node's value instead of in its key;
it doesn't really matter which way we go on that.
There are no more mutability flags -- mutability is determined by
examining whether the node key's txnID matches the txn in question.
Therefore, there is no stabilization walk at commit time.
When we commit a change to a node, the nodeID and copyID stay the
same, only the txnID changes. The new txnID is not necessarily
greater than the old one (sometimes txns get committed out of order!),
but anyway it's different from the old txnID, and the same new txnID
is used for all other changes in that commit. A txnID is just a
unique string; in fact, we're going to use base64 probably.
After a commit, the txn record in the transactions table does not go
away; instead, it is updated so it now maps the txnID to the new
revision. This allows us to determine the revision a node was
committed in, in constant time, from the node's key.
Each new version of a node stores the node's predecessor (and does not
store copyform history). When node "5.fzb" is committed as a
successor to "5.qnr", the new node's value stores a reference to
What about copies?
As in the current fs, copies are shallow. The top of the copied tree
gets a new node, but the nodes below it are shared with the copy
source. The new node key keeps the same nodeID, but gets a new txnID,
and gets the next unique copyID (replacing the current copyID, if
In the example below, we copy `A' to `Y'. Node keys for `A', `Y', and
`bar' are given in parentheses:
BEFORE THE COPY AFTER THE COPY
/ | / | \
/ | / | \
/ | / | \
A(3.m) B A(3.m) B Y(3.p.jfb)
/ \ | / \ | / \
foo bar qux foo bar qux foo bar
(5.abc) (5.abc) (5.abc)
Let's flesh out the example with some commits under A and Y. To save
space, the colons represent time flow, not directory hierarchy --
imagine they're the Z axis coming out of the screen or something :-).
/ | \
/ | \
/ | \
A(3.m) B Y(3.p.jfb)
/ \ | / \
foo bar qux foo bar
Let's see how easy it is to answer various questions in this system:
Obviously, is_related() is simple -- just check that the nodeID
portion is the same. You may not know if the relationship is cousins
vs ancestor/descendant, but you know whether or not they're related.
Asking is_predecessor(A,B) is also easy. Just fetch the predecessor
pointer from B and see if it's A.
Finding out what revisions a node changed in is proportional to the
number of changes the node has undergone: start at the node, walk back
through its predecessors, and for each txnID, look up the revision
number via the transactions table (as described earlier).
During this walk, you can always tell when you encounter a node that
results from a copy, because the copyID portion will either change or
disappear entirely. When this happens, you know one of two things is
true: either the previous node in the walk was the top of a copied
tree, or *this* node (the one with the different copyID) was one of
the unchanged nodes down inside a copied tree.
One might think "Oh, we'll distinguish between these cases by walking
up the parents of the node, and seeing if we eventually encounter the
old copyID in one of the parents. That would tell us that we're in
the second case. If we never encounter it, that tells us we're in the
Not so fast, Spidey. We don't have parent pointers -- this is a
predecessor walk by node keys; we can't just walk up the parent path
like that. Fortunately, copyIDs are generated from a new `copies'
table, which maps unique copyIDs onto (REV COPY_DST_PATH
COPY_DST_NODEKEY). We look up the the rev/path for the old copyID,
convert it to a node key, and compare it to the node key we're
currently on. VoilÓ! Actually, we're not sure we'll store all of
those in the copies table, it may boil down to just the DST_NODEKEY or
just the other two, we'll see.
Writing those predecessor walk loops well is left as an exercise for
the reader (umm, more likely for the writers, heh), but you can see
that the necessary questions can be answered efficiently.
Note that, like txnIDs, copyIDs are just unique numbers. They may be
increasing monotonically in the `copies' table, but (due to the fact
that txn A may be started before txn B yet be committed afterwards)
it's quite possible that a higher copyID will appear in the revision
history before a lower one.
The one thing we can know is that a lower copyID can never be a
branchwise descendent of a lower copyID, since the lower one must have
been committed before any of its descendants txns were started, of
course. I'm not sure this minimal inference will ever be useful, but
anyway it's all we've got. Anyway, right now, we only ever need ask
if two copyIDs are equal -- we don't order them.
Okay, now what if A already had copied trees underneath it when we
copied it to Y? Suppose `bar' is down in one of those subdirectories?
(This is the question you just asked me on the phone, I'm repeating it
for the benefit of our radio audience.)
Then when we're committing on /Y/.../bar, we watch for copyIDs as we
walk down from root, like usual, and if we're in a copy underneath a
copy, we bubble down a _new_ copyID, distinct from both Y's and B's,
starting from that point. Justification: a branch of a branch is
still a branch, so it gets its own copyID.
At this point, I'm going to hand-wave on describing the
copy-under-copy behavior any further. I think the above is enough to
see that there are no insurmountable problems here, and that the
filesystem will now have node keys that [lazily] reflect actual
branching patterns. It's time to send this email :-). Unless you (or
anyone else) sees something alarming in this proposal, we're ready to
start on this.
Thanks for all your brainwork, it's been a real help!
-Karl (& Mike, who had to leave before this mail was finished)
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Fri May 10 00:34:32 2002