RE: Maintaining NodeID sanity

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-05-08 02:59:38 CEST

I'm sorry for it being a bit dense; I wanted to get the idea out and
about and didn't have enough time to properly thrash the idea into a
power point slide deck, or some Visio diagrams. This unfortunately would
have helped a bunch given the bizarre DAGgy-ness that abounds here.

I'm afraid I'm not on the issues list atm, but I'll be adding myself to
654 shortly, but I'll paste these comments on your points in Issuezilla
as well.

Efficiently accessing ancestor lists:
Karl's concerned with the change from NodeRevisionID to (NodeID,
ChangeSetID) because it removes the ancestry information from the row.

Bill's reply:
This is true. I think you still want to store the entire predecessor
list inline using some kind of (hopefully server side merge history
recording friendly) string format. Here's an example format that allows
server side merge history recording.
E.g: in a Python like way:
# Ancestor history list:
  [
# Initial revision NodeChangePK info:
    [(NodeID1, ChangeSet1)],
# Incremental change of this Node:
    [(NodeID1, ChangeSet2)],
# A merge from a different file in the repository (again, just the
# NodeChangePK facts here):
    [(NodeID2, ChangeSet2)],
# A star merge from two separate files in the repository:
    [(NodeID2, ChangeSet3), (NodeID3, ChangeSet4)]
  ]

Just because we're moving the data out of the PK doesn't mean we can't
keep the data around in the row. We just don't want to use it in the PK.
:)

Put another way: Blob's are bad in index keys (i.e. Primary key), but
are perfectly ok as columns in your table.

For BDB you could always create a separate NodeChangeAnecstorBlob table
if it's deemed that it makes more perf sense that way.

Karl was wondering about whether it's important or not for ChangeSet's
to be incrementally ordered on a NodeChange's versioning timeline. i.e.
point in time 1 could have a ChangeSetID that's larger than the
NodeChange of the Node at point in time 2.

Bill's reply:
Yes this is possible to have, but it doesn't cause us any problems,
since an ordering still exists. You just have to join to the Repository
Revision Number -> ChangeSetID data. It imposes a little extra overhead,
but in general shouldn't be that nasty. Especially since repository
point in times -> ChangeSetID mappings are completely non-alterable,
ever... (i.e. they're easily cacheable)

Bill

----
Do you want a dangerous fugitive staying in your flat?
No.
Well, don't upset him and he'll be a nice fugitive staying in your flat.
 
> -----Original Message-----
> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
> Sent: Tuesday, May 07, 2002 5:36 PM
> To: Bill Tutt
> Cc: Subversion Dev list
> Subject: Re: Maintaining NodeID sanity
> 
> I'm printing this out and reading it at home right now, Bill (it's,
> uh, a bit dense, and so am I :-) ).  I'm curious if you've been
> watching my recent annotations to
> 
>   http://subversion.tigris.org/issues/show_bug.cgi?id=654
> 
> They describe some concerns I have about the new NodeID scheme.  Not
> showstoppers yet, afaict, but I'd like to know what you (and Brane,
> and everyone else) think of them...
> 
> Thanks,
> -K
> 
> "Bill Tutt" <billtut@microsoft.com> writes:
> > Branko came up with a decent idea to restore NodeID sanity to the
> > universe in the face of the weird side-effects of O(1) copies.
> >
> > The thinking goes something like this:
> >
> > We want to keep the NodeID the same for branches, but different for
> > copies.
> >
> > We want to only change the NodeID on the destination of the copy.
> >
> > A copy is really almost a branch. Let's think about just branches
for
> > the moment.
> >
> > So, we need a way to distinguish a NodeChange besides just the
normal
> > (NodeID, ChangeSetID) data.
> >
> > Let's add a BranchID to the PK of a NodeChange so that you have
(NodeID,
> > BranchID, ChangeSetID) as the PK of NodeChange.
> >
> > So when you "branch <src> <dest>" the result of the branch is a new
> > NodeChange that looks like this:
> > (<src NodeID>, <some opaque BranchID type>, <branch operation's
> > ChangeSetID>)
> >
> > Now all we need is to know exactly what a BranchID is.
> >
> > Branko's idea was to define a BranchID as:
> > NodeChangePKOf(parentOf(<dest>)).
> >
> > The problem with that is that the PK contains BranchID which is now
> > recursively defined. That's good for nested branches, but bad for a
> > schema.
> >
> > One alternative is to add a unique column to NodeChange:
NodeChangeID.
> > NodeChangeID is a monotonically increasing integer (possibly a
> > fixed-precision decimal considering that NodeChangeIDs will be used
much
> > more quickly than NodeIDs) that is unique for each NodeChange.
> >
> > NodeChangeID is not an additional part of the PK. It is an AK
> > (alternative key).
> >
> > BranchID can now be defined to be:  NodeChangeID of <dest>.
> >
> > When the user creates a branch of path "/foo" to "/bar" the
following
> > can now happen:
> > Create a new NodeChange of ("/foo".NodeID, BranchID ==
> > this.NodeChangeID, ChangeSetID) and associate it with "/bar". This
> > doesn't set any of the copy related properties, and only changes the
> > BranchID.
> >
> > When you are committing new changes to things under /bar (or
revision
> > bumps of /bar), any new nodes will have the BranchID set to "/bar''s
> > BranchID.
> >
> > The problem now becomes what I do when I have one of these
complicated
> > nested branch cases.
> >
> > e.g.:
> >
> > /foo/main/A
> > /foo/alt/A
> >
> > where /foo/alt is a branch of /foo/main.
> >
> > I then: "branch /foo /bar"
> > I then make a modification to "/bar/alt/A".
> >
> > What BranchID does A's NodeChange row now have?
> > If we lazily flatten sub-branches then it could be "/bar"'s
BranchID.
> > If we lazily create sub-branches then it could be "/bar/alt/"'s
> > BranchID.
> >
> > Flattening isn't terribly friendly to our users, so let's think
about
> > what we have to do if we lazily create our sub-branches.
> >
> > When examining /bar/alt we notice that the existing BranchID !=
/bar's
> > BranchID.  This may mean any number of things:
> > * It might be a copy, since we haven't told you how to differentiate
> > between a copy and a branch. If it existed inside of a copy, the
> > BranchID is unimportant since if it needs to change to be /bar's
> > BranchID.
> >
> > * It might be a branch, and if so, we need to create a new
NodeChange
> > for alt, containing a new BranchID. The origin of this branch is
> > recorded as being /foo/alt.
> > Once we have this new BranchID, we can then use it while creating
the
> > new NodeChange for /bar/alt/A.
> >
> > But, how do we detect that /bar/alt is a branch instead of a copy?
(i.e.
> > it might be a copy of a branch)
> >
> > Alt.BranchID points to /foo/alt's data. (The location of the
original
> > branch.) When I look that data up, I'll notice that the copy
information
> > is absent. This tells me I have a branch. If I had a copy, the copy
> > properties would be filled in.
> >
> > The only additional annoyance is that copy/branch destination nodes
> > don't record the path that they were created at. Given the DAGgy
nature
> > of our data store, it would certainly help if they included what
their
> > creation path was.
> >
> > i.e.: /bar was created at path /bar.
> >
> > To sum up:
> >   Add a BranchID to the PK of NodeChange.
> >   Add a NodeChangeID as an AK of NodeChange. (i.e. monotonically
> > increasing integer that uniquely, but not usefully defines a
NodeChange
> > row)
> >   Use the NodeChangeID as the BranchID when creating
copies/branches.
> >   Add a CreatedPath to NodeChange that is only used for copy/branch
> > operations.
> >
> > Alternatively, you could get away without having NodeChangeID since
it
> > requires a separate BDB table to repeat the PK of NodeChange to
point to
> > the correct NodeChange row.
> >
> > If we do the above we get the following benefits:
> > * NodeIDs stay the same for branches, but change for copies.
> > * NodeIDs are attached to the path that they're created in. No more
> > non-deterministic migrating of NodeID usages depending on which side
of
> > a copy got modified first.
> > * We get the easy possibility of handling O(1) branches,
sub-branches,
> > and sub-sub-branches, etc....
> > * Computing the next NodeID to use for a NodeChange alteration is
much
> > quicker.
> > (It's built up as you traverse the path space looking for
NodeChange's
> > to create new revisions of.)
> > * svn log for branched items includes all of the history of the node
> > before the branch. Since the inline history revision data is
preserved.
> > * The addition of created path to copy/branch points reduces poor
svn
> > log's workload when told to figure out what it needs to do, but
can't
> > easily output where a nested branch/copy came from.
> >
> > Sounds like utter goodness to me. :)
> >
> > Bill
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail: dev-help@subversion.tigris.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Wed May 8 03:00:34 2002

This message: [ Message body ]
Next message: Karl Fogel: "Re: Maintaining NodeID sanity"
Previous message: Glenn A. Thompson: "Re: Maintaining NodeID sanity"
In reply to: Karl Fogel: "Re: Maintaining NodeID sanity"
Next in thread: Karl Fogel: "Re: Maintaining NodeID sanity"
Reply: Karl Fogel: "Re: Maintaining NodeID sanity"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]