RE: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)

From: Bill Tutt <billtut_at_microsoft.com>
Date: 2002-03-14 20:48:20 CET

> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
>
> Okay, Bill, I *think* I understand what you're saying, but let me try
> to summarize it just to make sure:
>
> We can track LoD history (formerly known as CopyFrom history) by
> associating a particular copyfrom fact with a NodeID. Since in
> "NodeID.TransactionID", the NodeID stays constant for the life
> of the node, the association is constant time/space -- it only has
> to be recorded once, when the new NodeID is first generated, and
> thereafter it can be looked up via that NodeID.
>
> This makes it easy to tell that a particular node deep within a
> copied tree is a copy (just look for LoD history here, then walk up
> parents doing same). O(n), as you say, but no big deal, because n
> is just directory depth. Not only is this useful for "svn log" and
> other reporting mechanisms, but we can also use it to keep NodeIDs
> from varying counterintuitively, thusly: when committing a change
> to X.Q, we would normally create X.(R) (where R is some
> TransactionID greater than Q), but if there's LoD history on X.Q
> and the original also has NodeID X, then we want to reserve the X
> line for the original, so we create Y.R instead.
>
> I'm wondering if I have this right, though, since you don't mention a
> new table or anything... (?).
>
> There was a lot I wasn't sure I understood in the mail, though.
> Before I start asking questions, perhaps you can say whether the above
> summary is basically right or basically wrong. :-)
>

Yes, that's the general gist of the email. I didn't mention a new table
because I listed alternate ways of achieving the same goal without
adding a new table. (Given that more tables in BDB tend to make life
slightly more annoying.) The only non-table way that's still space
efficient is to move the LoD history storage into the NodeRevision's
property list Representation. This is only space efficient if properties
are always stored using reverse deltas, since LoD never changes for a
given NodeID.

There's no particular reason we couldn't add a new table though. I've
just been defaulting to trying to find solutions that don't require new
tables if at all possible.

> Small request: If you send a clarifying email, might help to start out
> with an up-front statement of goals, that is, exactly what problems
> the new proposal is trying to solve.
>

The major problem is that LoD tracking is basically too expensive given
the current plan of record for the NodeRevision table related changes.
The still remaining secondary problem is that NodeRevision ancestry
information isn't also stored in a normalized fashion in a separate
table, but I don't expect that to be solved for 1.0. It's strictly a
nice to have, not a must have.

I do apologize for how weird some of the example data sentences look,
but it falls out from a formalized view of the data model in which each
of these sentences is constructible from. I also wanted to go into
intimate detail explaining the problem so that I made sure I didn't
oversimplify my reasoning process.

I think it might have also been easier to grok if I had formatted the
data sentences, and the intervening actions better. Silly plain text
email... :)

I'll certainly remember to include the problems I'm trying to solve in
any further long explanations of mine.

Thanks for writing my abstract for me,
Bill

> -K
>
> "Bill Tutt" <billtut@microsoft.com> writes:
> > Currently NodeRevisions have an optional piece of information about
them
> > if they were the direct result of a user using "svn copy". To the
user
> > this creates a "branch" or to reduce terminology confusion a Line of
> > Development (LoD).
> >
> > Additionally, to reduce the complexity that is the current
NodeRevision
> > PK proposal ('NodeID.TransactionID'), we'll use the traditional
> > NodeID.ChangeNumber notation.
> >
> > i.e.: 1.1 -> 1.2 -> 1.3
> >
> > Say we have this simple example of repository paths that include
only
> > files:
> > /a/b
> > /a/e
> >
> > The path '/a/b' points to NodeRevision '1.3'.
> > NodeRevision '1.3' has a version ancestry of: 1.1 -> 1.2 -> 1.3
> >
> > The user now does: "svn copy /a /c"
> >
> > This creates the "/c" NodeRevision '2.1' that was copied from "/a"
at
> > repository version N.
> >
> > The path '/c' points to NodeRevision '2.1' has a Representation 'R'
that
> > contains a DirectoryEntry 'b' that points to NodeRevision '1.3'.
> >
> > The user now adds a file '/c/d' and submits this change to the
> > repository.
> >
> > The path '/c' now points to NodeRevision '2.2' that has a
Representation
> > 'R2' that contains a DirectoryEntry 'd' that points to NodeRevision
> > '3.1'.
> >
> > Representation 'R2' still contains a DirectoryEntry 'a' that points
to
> > NodeRevision '1.3'.
> >
> > The user now runs: "svn log /c/b"
> >
> > While the output of "svn log" is much better than it was, it still
> > doesn't identify /c/b as being a forked LoD of /a/b.
> >
> > So, how would be go about finding this out?
> > At the moment, we'd need to execute the following algorithm:
> > For each directory in the path to the file (order doesn't really
matter
> > here):
> > If the directory has the "COPY" information on it:
> > LoD = "COPY" information
> > Else
> > For each predecessor of this directory (from youngest to
oldest):
> > If PredecessorNodeID != DirectoryNodeID:
> > # We've gone too far back in our ancestry information
> > # We should have found something on our previous
> > iteration.
> > Assert(false)
> > Break
> > End If
> > Load predecessor data from BDB
> > If the directory has the "COPY" information on it:
> > LoD = "COPY" information
> > Break
> > End If
> > Next
> > End If
> > Next
> >
> > The LoD of the file is the path for the deepest recorded branched
LoD.
> >
> > Let's see an example of how this could come to be before getting
back to
> > discussing how expensive the above algorithm is:
> >
> > Initial repository state:
> > /linux/2.0: Contains everything related to the 2.0
> > kernel sources.
> > /linux/2.0/drivers/ata: Contains the IDE drivers for 2.0.
> > /linux/2.0/drivers/billsata: Contains a branched LoD for
> > /linux/2.0/drivers/ata
> >
> > User issues: "svn copy /linux/2.0 /linux/2.1"
> >
> > The files in /linux/2.1/drivers/billsata would have a LoD of:
> > /linux/2.1/drivers/billsata
> >
> > Anyway, the branch discovery algorithm above is fairly expensive. We
> > don't know ahead of time how expensive it will be. There's not a
good
> > way to control the expense here. We don't know how many revisions
any of
> > the component directories have undergone, so we don't know an
optimal
> > ordering to search the directory histories.
> >
> > I thought of two ways to make this cheaper in time:
> > * When creating a new NodeRevision that has COPY information in it's
> > skel just duplicate the COPY information.
> > This has a pretty large space cost, so I don't like this one so
much.
> > * Move the COPY information into libsvn_fs set properties. The
property
> > would only get set on a brand new NodeRevision as the result of "svn
> > copy" and would always be on NodeRevision on that LoD. This is just
as
> > space efficient as the current system is since property storage also
> > uses delta representations. (IIRC) However, we now have an O(n)
> > algorithm for a n directory deep file.
> >
> > Just in case you thought I was done, I'm not. :)
> >
> > If we pick the second option above, this has a nice benefit of
making it
> > much easier in the new NodeRevision PK world (NodeID.TransactionID)
to
> > determine when we need to use a new NodeID for a new LoD. This is
nice
> > to have because it provides the nice sanity preserving property of
> > having differing NodeIDs for differing LoDs.
> >
> > If we don't pick the second option above the following situation is
> > likely:
> >
> > Path '/a/b' points to NodeRevisionID '1.3'
> > Path '/c/b' points to NodeRevisionID '1.3'
> >
> > The user then edits Path '/c/b' and commits the change.
> > Since we don't have a cheap way to calculate LoD-ness the following
> > would happen:
> >
> > Path '/c/b' now points to NodeRevisionID '1.4'
> >
> > The user then edits Path '/a/b' and commits the change.
> > We'd now notice that the next ChangeNumber was already used and
> > determine that we need a new NodeID. We'd then get this data:
> >
> > Path '/a/b' points to NodeRevisionID '2.1' with an ancestry of: '1.1
->
> > 1.2 -> 1.3 -> 2.0'
> > Path '/c/b' points to NodeRevisionID '1.4' with an ancestry of: '1.1
->
> > 1.2 -> 1.3 -> 1.4'
> >
> > I don't know about you, but that drives me batty. :)
> >
> > Preserving the LoDness of a NodeID will also make answering the
> > following question easier:
> >
> > "Given file /a/b/c what are all of the other LoDs of this file?"
> >
> > The answer to this question presents an exhaustive list of LoD merge
> > sources btw.
> >
> > Phew... I have a feeling I'll need to clarify parts of this long
email.
> > Feel free to complain if I didn't communicate well enough.
> >
> > Bill
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 14 21:10:35 2002

This message: [ Message body ]
Next message: Blair Zajac: "basic_tests.py fails"
Previous message: Greg Stein: "Re: [PATCH] correct some memory problem"
Maybe in reply to: Bill Tutt: "More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Next in thread: Bill Tutt: "RE: RE: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"
Reply: Bill Tutt: "RE: RE: More fs fun: COPY sources and detecting LinesOfDevelopment (branches)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]