[ pulling several responses into a single thread. Karl, Greg, then Branko ]
On Wed, Jan 03, 2001 at 07:32:44AM -0600, Karl Fogel wrote:
> Greg Stein <firstname.lastname@example.org> writes:
> > Question: should /B/foo and /B/bar know that they came from /A/foo? To put
> > it another way, when somebody pulls up a change log for /B/foo:
> The first answer is, "Yes, they do".
> But the real answer is: individual nodes know nothing, the repository
> knows everything.
AFAICT, the repository doesn't know everything. In my example, I made a copy
of /A/foo to /B/foo, then another copy to /B/bar. Where was /B/bar copied
from? Was it /A/foo or /B/foo?
Without additional information being recorded, I see no way to reconstruct
> When you generate a changelog, all the information
> you need to establish copy/rename/delete history is *available*, the
> only question is whether the log generation process used all of that
The general intent of "it's in there somewhere" is fine, but I think we're
missing some data.
> So this question:
> > 1) they will see the changes from 73.1 thru 73.4
> > 2) will they *also* see that /B/foo was copied from /A/foo?
> Well, the file's themselves don't "know" anything, but, yes, if your
> changelog generation process uses data from other sources (such as the
> parent directories), then it will reflect this history.
The changelog would see that /B/foo was introduced at some point in time,
and that it must have been a copy (the 73.4 could not possibly be a new
file). However, it cannot know where to find the original 73.4 (/A/foo)
without an exhaustive search of the tree. As stated above, it also cannot
know which of several it may have been copied from (if multiple instances of
73.4 are linked into the tree).
Even worse, if 73.1 is added to a directory, then it looks like a new file.
There is no way that the changelog generation can determine that it was made
from a copy.
> > I think another way to phrase the question is:
> > Do we record a copy from one location to another in the FS? If so, then
> > how is it recorded? Is there a machine-readable marker? (the checkin
> > comment is not machine-readable)
> Yes. We record it by virtue of the fact that a node is linked to from
> place X in one revision, and from both X and Y in some higher
> revision. In that case, it must have been copied to Y.
I believe this is insufficient; see above.
> One might respond, "But it could be an expensive exhaustive search to
> find that out". To which I would answer
> 1. It's the usual `update' problem -- find the places where one
> tree in the repository differs from another, and how exactly.
You can find out that something was added to /B (/B/foo and /B/bar), but you
cannot determine if they were copied from elsewhere, and where they were
> 2. If it is expensive, it can be cached using non-historical
Not only is it expensive, it is impossible :-)
> Does that help?
Yup. It got my brain working a bit more, with new insights.
I believe that we need to rethink how the copies are performed *if* we want
to retain knowledge that they were copied from somewhere else.
Note that if we don't want to record that they were copied, then the history
log for /A/foo, /B/foo, and /B/bar will all look exactly the same (since we
can't tell which is the original, which are copies, and when in that history
we note the copy occurred).
Greg Hudson wrote:
> Greg Stein wrote:
> > Note that a copy is still O(1) because we do the branching only on
> > the top node of the copy, not a whole tree.
> I don't see the motivation for creating the properties if we only do
> it at the top level of a copy; the "log" output in a file under the
> copied directory still wouldn't be able to mention that the file was
> copied at a particular point.
Argh. You're entirely correct.
If we wanted to report "this file was copied from /X/Y/Z/blarg", then we'd
need to scan upwards to find that the current tree was copied from /X/Y. And
per my earlier arguments, we'd have no way of knowing that a particular file
was copied -- we'd always do the scan (usually for no gain) or we'd always
punt the upwards scan and not be able to report copies.
I do agree that copies should remain O(1).
Possibly, during the scan down to /F/G/H/blarg, the FS would see that /F/G
was copied from /X/Y and could return that info when somebody asked the FS
for the copy-history of blarg.
What gets fun is when /F/G is copied from /X/Y, then you copy to /F/G/H to
/M/N. What does the history log for /M/N say? Does it just refer to /F/G/H,
or does it go all the way back to /X/Y/Z?
[ I believe it would just go back one step; the FS caller would perform the
additional linking to prior ancestors ]
Branko Cibej wrote:
> Greg Stein wrote:
> > To answer my own question, and state my desired outcome:
> > *) when a copy is made, we create a new branch (e.g. create 22.214.171.124)
> I think we most emphatically do /not/ want to create a new branch for a
> copy; the semantics are wrong. If we start creating branches for every
> copy, the version tree (as implicit in the structure of the node ID)
> will no longer correspond to real branches in the repository.
> A copy must either stay on the same branch, or create a new node (with a
> backlink to its origin as a property).
Totally fine. We don't have the "baselink" notion today, so I was working
with what we had :-)
> Personally I think creating a new node is best, although we don't
> (right now) have any way to represent the ancestry link (it's an
> immutable non-historic property, after all).
Agreed -- I would also prefer a "baselink" over a branch. That also cleans
up the "add properties" issues (but again: I worked with what I had :-).
> Well ... except if we don't care about what the node ID structure says,
> and are only interested about branches on the repository root?
Honestly, I'm a bit confused about what a "branch" truly means for us. I was
under the impression that a "branch" was always done with "copy it over
there and begin working." But that doesn't bother me right now (read: don't
lose the focus of this email :-); the issue I'm concerned with is recording
> Hmm. In that case, I'd still prefer that the copy itself touches only
> the directory, and we branch off the file only when changes are actually
> made to it. That would be consistent with the way we (will) implement
> tags and "real" branches.
Well, I think we'd be touching a directory or a file -- depends on what is
copied. If I just copy "blarg", then we'd create a new blarg node with a
baselink to the original. Yes, the directory gets updated, but the directory
doesn't record baselinks for the nodes contained within it (I'd say the node
itself retained that information).
To keep the copy operation O(1), the baselink would only occur on the root
of a directory-copy. As you state: changes to children would create branches
at that point.
That does introduce a bit of a dichotomy: the root of a copy uses a
baselink; children use a branch. I think we can determine if a child was
copied by noting the parent-copy (presence of a baselink) during the
traversal down to the child. However, this poses problems in determining
whether we've made a copy-on-write of a child: how can you tell if the
copy-on-write has occurred already? If it is based on baselink presence,
then we run into a problem with children in the copy-source who already had
Possible answer: the baselink records the revision of when the copy was
made. The parent has revision N in its baselink. When we traverse to the
children, find we need to do a copy-on-write, we examine its baselink
revision (if present; call it M). If M < N, then the child was from a copy
before the parent was copied, and we need to create another copy. If M > N,
then we've already made a copy and can simply change the child.
Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:36:19 2006