Re: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2002-03-14 20:34:27 CET

Okay, Bill, I *think* I understand what you're saying, but let me try
to summarize it just to make sure:

   We can track LoD history (formerly known as CopyFrom history) by
   associating a particular copyfrom fact with a NodeID. Since in
   "NodeID.TransactionID", the NodeID stays constant for the life
   of the node, the association is constant time/space -- it only has
   to be recorded once, when the new NodeID is first generated, and
   thereafter it can be looked up via that NodeID.

   This makes it easy to tell that a particular node deep within a
   copied tree is a copy (just look for LoD history here, then walk up
   parents doing same). O(n), as you say, but no big deal, because n
   is just directory depth. Not only is this useful for "svn log" and
   other reporting mechanisms, but we can also use it to keep NodeIDs
   from varying counterintuitively, thusly: when committing a change
   to X.Q, we would normally create X.(R) (where R is some
   TransactionID greater than Q), but if there's LoD history on X.Q
   and the original also has NodeID X, then we want to reserve the X
   line for the original, so we create Y.R instead.

I'm wondering if I have this right, though, since you don't mention a
new table or anything... (?).

There was a lot I wasn't sure I understood in the mail, though.
Before I start asking questions, perhaps you can say whether the above
summary is basically right or basically wrong. :-)

Small request: If you send a clarifying email, might help to start out
with an up-front statement of goals, that is, exactly what problems
the new proposal is trying to solve.

-K

"Bill Tutt" <billtut@microsoft.com> writes:
> Currently NodeRevisions have an optional piece of information about them
> if they were the direct result of a user using "svn copy". To the user
> this creates a "branch" or to reduce terminology confusion a Line of
> Development (LoD).
>
> Additionally, to reduce the complexity that is the current NodeRevision
> PK proposal ('NodeID.TransactionID'), we'll use the traditional
> NodeID.ChangeNumber notation.
>
> i.e.: 1.1 -> 1.2 -> 1.3
>
> Say we have this simple example of repository paths that include only
> files:
> /a/b
> /a/e
>
> The path '/a/b' points to NodeRevision '1.3'.
> NodeRevision '1.3' has a version ancestry of: 1.1 -> 1.2 -> 1.3
>
> The user now does: "svn copy /a /c"
>
> This creates the "/c" NodeRevision '2.1' that was copied from "/a" at
> repository version N.
>
> The path '/c' points to NodeRevision '2.1' has a Representation 'R' that
> contains a DirectoryEntry 'b' that points to NodeRevision '1.3'.
>
> The user now adds a file '/c/d' and submits this change to the
> repository.
>
> The path '/c' now points to NodeRevision '2.2' that has a Representation
> 'R2' that contains a DirectoryEntry 'd' that points to NodeRevision
> '3.1'.
>
> Representation 'R2' still contains a DirectoryEntry 'a' that points to
> NodeRevision '1.3'.
>
> The user now runs: "svn log /c/b"
>
> While the output of "svn log" is much better than it was, it still
> doesn't identify /c/b as being a forked LoD of /a/b.
>
> So, how would be go about finding this out?
> At the moment, we'd need to execute the following algorithm:
> For each directory in the path to the file (order doesn't really matter
> here):
> If the directory has the "COPY" information on it:
> LoD = "COPY" information
> Else
> For each predecessor of this directory (from youngest to oldest):
> If PredecessorNodeID != DirectoryNodeID:
> # We've gone too far back in our ancestry information
> # We should have found something on our previous
> iteration.
> Assert(false)
> Break
> End If
> Load predecessor data from BDB
> If the directory has the "COPY" information on it:
> LoD = "COPY" information
> Break
> End If
> Next
> End If
> Next
>
> The LoD of the file is the path for the deepest recorded branched LoD.
>
> Let's see an example of how this could come to be before getting back to
> discussing how expensive the above algorithm is:
>
> Initial repository state:
> /linux/2.0: Contains everything related to the 2.0
> kernel sources.
> /linux/2.0/drivers/ata: Contains the IDE drivers for 2.0.
> /linux/2.0/drivers/billsata: Contains a branched LoD for
> /linux/2.0/drivers/ata
>
> User issues: "svn copy /linux/2.0 /linux/2.1"
>
> The files in /linux/2.1/drivers/billsata would have a LoD of:
> /linux/2.1/drivers/billsata
>
> Anyway, the branch discovery algorithm above is fairly expensive. We
> don't know ahead of time how expensive it will be. There's not a good
> way to control the expense here. We don't know how many revisions any of
> the component directories have undergone, so we don't know an optimal
> ordering to search the directory histories.
>
> I thought of two ways to make this cheaper in time:
> * When creating a new NodeRevision that has COPY information in it's
> skel just duplicate the COPY information.
> This has a pretty large space cost, so I don't like this one so much.
> * Move the COPY information into libsvn_fs set properties. The property
> would only get set on a brand new NodeRevision as the result of "svn
> copy" and would always be on NodeRevision on that LoD. This is just as
> space efficient as the current system is since property storage also
> uses delta representations. (IIRC) However, we now have an O(n)
> algorithm for a n directory deep file.
>
> Just in case you thought I was done, I'm not. :)
>
> If we pick the second option above, this has a nice benefit of making it
> much easier in the new NodeRevision PK world (NodeID.TransactionID) to
> determine when we need to use a new NodeID for a new LoD. This is nice
> to have because it provides the nice sanity preserving property of
> having differing NodeIDs for differing LoDs.
>
> If we don't pick the second option above the following situation is
> likely:
>
> Path '/a/b' points to NodeRevisionID '1.3'
> Path '/c/b' points to NodeRevisionID '1.3'
>
> The user then edits Path '/c/b' and commits the change.
> Since we don't have a cheap way to calculate LoD-ness the following
> would happen:
>
> Path '/c/b' now points to NodeRevisionID '1.4'
>
> The user then edits Path '/a/b' and commits the change.
> We'd now notice that the next ChangeNumber was already used and
> determine that we need a new NodeID. We'd then get this data:
>
> Path '/a/b' points to NodeRevisionID '2.1' with an ancestry of: '1.1 ->
> 1.2 -> 1.3 -> 2.0'
> Path '/c/b' points to NodeRevisionID '1.4' with an ancestry of: '1.1 ->
> 1.2 -> 1.3 -> 1.4'
>
> I don't know about you, but that drives me batty. :)
>
> Preserving the LoDness of a NodeID will also make answering the
> following question easier:
>
> "Given file /a/b/c what are all of the other LoDs of this file?"
>
> The answer to this question presents an exhaustive list of LoD merge
> sources btw.
>
> Phew... I have a feeling I'll need to clarify parts of this long email.
> Feel free to complain if I didn't communicate well enough.
>
> Bill
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 14 20:24:47 2002

This message: [ Message body ]
Next message: Greg Stein: "Re: svn commit: rev 1511 - trunk/subversion/include trunk/subversion/libsvn_ra_local trunk/subversion/libsvn_client trunk/subversion/libsvn_ra_dav"
Previous message: Zack Weinberg: "Re: subversion in the news"
In reply to: Bill Tutt: "More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Next in thread: Bill Tutt: "RE: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]