More fs fun: COPY sources and detecting LinesOfDevelopment (branches)

From: Bill Tutt <billtut_at_microsoft.com>
Date: 2002-03-14 09:24:57 CET

Currently NodeRevisions have an optional piece of information about them
if they were the direct result of a user using "svn copy". To the user
this creates a "branch" or to reduce terminology confusion a Line of
Development (LoD).

Additionally, to reduce the complexity that is the current NodeRevision
PK proposal ('NodeID.TransactionID'), we'll use the traditional
NodeID.ChangeNumber notation.

i.e.: 1.1 -> 1.2 -> 1.3

Say we have this simple example of repository paths that include only
files:
/a/b
/a/e

The path '/a/b' points to NodeRevision '1.3'.
NodeRevision '1.3' has a version ancestry of: 1.1 -> 1.2 -> 1.3

The user now does: "svn copy /a /c"

This creates the "/c" NodeRevision '2.1' that was copied from "/a" at
repository version N.

The path '/c' points to NodeRevision '2.1' has a Representation 'R' that
contains a DirectoryEntry 'b' that points to NodeRevision '1.3'.

The user now adds a file '/c/d' and submits this change to the
repository.

The path '/c' now points to NodeRevision '2.2' that has a Representation
'R2' that contains a DirectoryEntry 'd' that points to NodeRevision
'3.1'.

Representation 'R2' still contains a DirectoryEntry 'a' that points to
NodeRevision '1.3'.

The user now runs: "svn log /c/b"

While the output of "svn log" is much better than it was, it still
doesn't identify /c/b as being a forked LoD of /a/b.

So, how would be go about finding this out?
At the moment, we'd need to execute the following algorithm:
For each directory in the path to the file (order doesn't really matter
here):
    If the directory has the "COPY" information on it:
       LoD = "COPY" information
    Else
       For each predecessor of this directory (from youngest to oldest):
          If PredecessorNodeID != DirectoryNodeID:
             # We've gone too far back in our ancestry information
                 # We should have found something on our previous
iteration.
               Assert(false)
             Break
          End If
            Load predecessor data from BDB
          If the directory has the "COPY" information on it:
             LoD = "COPY" information
             Break
          End If
       Next
    End If
Next

The LoD of the file is the path for the deepest recorded branched LoD.

Let's see an example of how this could come to be before getting back to
discussing how expensive the above algorithm is:

Initial repository state:
/linux/2.0: Contains everything related to the 2.0
kernel sources.
/linux/2.0/drivers/ata: Contains the IDE drivers for 2.0.
/linux/2.0/drivers/billsata: Contains a branched LoD for
/linux/2.0/drivers/ata

User issues: "svn copy /linux/2.0 /linux/2.1"

The files in /linux/2.1/drivers/billsata would have a LoD of:
/linux/2.1/drivers/billsata

Anyway, the branch discovery algorithm above is fairly expensive. We
don't know ahead of time how expensive it will be. There's not a good
way to control the expense here. We don't know how many revisions any of
the component directories have undergone, so we don't know an optimal
ordering to search the directory histories.

I thought of two ways to make this cheaper in time:
* When creating a new NodeRevision that has COPY information in it's
skel just duplicate the COPY information.
This has a pretty large space cost, so I don't like this one so much.
* Move the COPY information into libsvn_fs set properties. The property
would only get set on a brand new NodeRevision as the result of "svn
copy" and would always be on NodeRevision on that LoD. This is just as
space efficient as the current system is since property storage also
uses delta representations. (IIRC) However, we now have an O(n)
algorithm for a n directory deep file.

Just in case you thought I was done, I'm not. :)

If we pick the second option above, this has a nice benefit of making it
much easier in the new NodeRevision PK world (NodeID.TransactionID) to
determine when we need to use a new NodeID for a new LoD. This is nice
to have because it provides the nice sanity preserving property of
having differing NodeIDs for differing LoDs.

If we don't pick the second option above the following situation is
likely:

Path '/a/b' points to NodeRevisionID '1.3'
Path '/c/b' points to NodeRevisionID '1.3'

The user then edits Path '/c/b' and commits the change.
Since we don't have a cheap way to calculate LoD-ness the following
would happen:

Path '/c/b' now points to NodeRevisionID '1.4'

The user then edits Path '/a/b' and commits the change.
We'd now notice that the next ChangeNumber was already used and
determine that we need a new NodeID. We'd then get this data:

Path '/a/b' points to NodeRevisionID '2.1' with an ancestry of: '1.1 ->
1.2 -> 1.3 -> 2.0'
Path '/c/b' points to NodeRevisionID '1.4' with an ancestry of: '1.1 ->
1.2 -> 1.3 -> 1.4'

I don't know about you, but that drives me batty. :)

Preserving the LoDness of a NodeID will also make answering the
following question easier:

"Given file /a/b/c what are all of the other LoDs of this file?"

The answer to this question presents an exhaustive list of LoD merge
sources btw.

Phew... I have a feeling I'll need to clarify parts of this long email.
Feel free to complain if I didn't communicate well enough.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 14 09:25:35 2002

This message: [ Message body ]
Next message: Luc Saillard: "[PATCH] correct some memory problem"
Previous message: Zack Weinberg: "Re: subversion in the news"
Next in thread: Karl Fogel: "Re: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Reply: Karl Fogel: "Re: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Maybe reply: Bill Tutt: "RE: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]