RE: Keeping NodeRevisionIDs from being unbounded

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-03-08 19:54:14 CET

> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
>
> Wow, Bill, great thinking! I'm sold, except for some minor comments,
> see below:
>

It's amazing how an anal modeling tool helps you become super focused on
what you're really trying to do. :)

> "Bill Tutt" <rassilon@lyra.org> writes:
> > NodeChange is identified by (NodeID, TransactionID)
> > (NodeID, TransactionID) has a set of predecessors (NodeID,
> > TransactionID)
> > Ok, this handles our notion of ancestry, and a notion of being able
to
> > handle pending private trees. Now we need to correctly order
> > NodeChanges.
> >
> > The transaction identified by TransactionID was commited as
> > RepositoryVersionNumber.
> >
> > Node has a sequence of changes ordered by (NodeID,
> > RepositoryVersionNumber)
> > This implies that finding a Node's predecessor in the sequence of
> > NodeChanges does become slightly more expensive. But it does
accomplish
> > a couple of pretty cool things:
> >
> > * We only have to insert one row associating the TransactionID with
the
> > RepositoryVersionNumber during svn_fs_commit_txn. As opposed to
setting
> > the SOURCE-REV to RepositoryVersionNumber during stabalization.
>
> Actually, I doubt that this stabilization walk is a great expense.
> Currently, we already have to walk all those nodes to mark them as
> immutable; but even more to the point, the expense of doing a commit
> lies not in the final stabilizations, but in building the data into
> the txn in the first place. Getting rid of the O(N) stabilization
> walk won't help much in practice -- N is just the number of changed
> nodes, and we're do other O(N) things whose constant factor is much
> higher anyway.
>
> Given that, I actually think we should keep the final stabilization
> walk and use it to set each NodeChange to
>
> (NodeID, RepositoryVersionNumber)
>
> that is, just replace all the TransactionIDs with the new revision
> number. Sure, this is no more "theoretically" correct than using the
> TransactionID, but it's more comprehensible to anyone who has to debug
> the filesystem. It's natural for the "Change" component of
> "NodeChange" to be, well, the revision in which the change happened.
> [I understand that there is no need in any case to permanently record
> a mapping from one to the other -- just examine the Changed portion of
> the revision's root NodeChanged :-) ].
>
> So in as-yet-uncommitted NodeChanges, maybe the Change portion should
> be the TransactionID, and part of stabilization would be to change
> those to the new revision number?
>
> But: in discussion with Greg Stein, Mike, and Ben just now, we
> discovered that we won't need the stabilization walk anymore, because
> we won't need explicit mutability flags. Instead, we can just compare
> our txn id with the one on the node in question -- if they're the
> same, we should consider this mutable, otherwise, the node is to be
> treated as immutable (either it's committed, or it's mutable only for
> some *other* txn). Heh!
>
> Hmm, given that, I'm not sure that walking to convert TxnID to RevNum
> is worthwhile, but would like your thoughts on this...
>

I would tend to agree. The only possible benefit of renumbering the
Change portion would be to make the Change portion an increasing integer
ID specific to the NodeID. The only useful benefit to that would be so
that commit emails could look like this: (if you didn't want to send
diffs)

Files changed:
NodepathA#2
NodepathB#3
NodepathC#4

where the portion after # is the Change portion.

While cute looking as a UI, I don't think this benefit is worth the perf
cost.

> > * The svn log problem becomes much simpler
>
> Can you describe this in more detail?
I think this is just identical to pointing out we now have ChangeSets.

>
> > * NodeChanges no longer have an unbounded PK which makes
transitioning
> > to SQL later much easier.
>
> Amen -- huge win. :-)
>
> > * I don't know whether you noticed in the above discussion that
these
> > changes would allow you to rename Transaction to Changeset, and that
now
> > finding the list of files modified in a given
RepositoryVersionNumber
> > bump is much easier.
>
> :-) Another huge win.
>
> > So, how badly does this affect the current predecessor and ancestor
> > finding algorithms?
> > * It really doesn't necessarily alter the cost of common ancestor
> > finding.
> > NodeChange's set of predecessors (in terms of just the direct line
based
> > on NodeID) could be determined by a string blob of successive
> > TransactionIDs. (e.g. 1.2.3.4.5.6 would mean that this Node changed
in
> > Transactions 1-6) This takes care of the common immediate
predecessor
> > set has a size of 1 case. (i.e. not from a merge) Merges still could
be
> > recorded as they are now as additional chunks of data on a
NodeChange.
> > (Note: svn currently needs to fix more spots in the svn fs code to
use
> > svn_fs_check_related for common ancestor checking)
>
> One cost that does change is finding the immediate predecessor of a
> given node revision. Currently, it is usually constant time and
> requires no DB access -- you can do it *entirely* by looking at the
> NodeID, and when you can't, you can tell that by inspection instantly.
>
> In the new system (whether we use the TransactionID or RevisionID),
> we'll have to record ancestry explicitly. Frankly, I think this might
> be an advantage -- it opens the door to storing multiple ancestry
> (i.e., merge information) in a natural way.
>

Not necessarily. Just because we're changing our PK doesn't mean we have
to change our ancestry storage in any substantial way. (i.e. create a
separate table) We just add a new part to the NodeChange skel that is a
NodeChangeAncestry string. If we just wanted to preserve our current
Node only specific line of ancestry then we could just use a string
like: "ChangeID1.ChangeID2.ChangeID3.ChangeID4.ChangeID5" with the
implied "NodeID." prefix on all of the ChangeIDs in the string. If we
want to represent merge history in the ancestory string then we need to
get more complex. Something like the following might be necessary:
NodeChangeAncestry: List of Predecessors
Predecessor: Set of (NodeID, ChangeID) tuples.

Having Predecessor include the NodeID handles the merge case. This would
give us the following NodeChangeAncestry examples (in Python list
syntax, but since this is a string, elide the NodeID if it should be
implied from the current row):
A normal change in some file:
    [ [(, ChangeID] ]
Two successive normal changes in some file:
    [ [(, ChangeID1)], [(, ChangeID2)] ]
A file resulting from a branch (aka a copy):
    [ [(NodeID, ChangeID)] ]
A file resulting from 2 checkins, and then a merge from branch A:
    [ [(,ChangeID1)], [(,ChangeID2)], [(NodeIDA, ChangeIDA)] ]
Note: This state implies that ChangeID2 was part of a merge operation.
A file resulting from 2 checkins, and then a more than one way merge
from branch A and branch B:
    [ [(,ChangeID1)], [(,ChangeID2)],
      [(NodeIDA, ChangeIDA), (NodeIDB, ChangeIDB] ]

Having an unbounded blob as something other than the PK is perfectly
fine, so there isn't anything wrong with this idea from a SQL store
perspective.

However, there are cases where it might be nice to have ancestry in a
separate table. How else would you efficiently answer the question: What
are all the branches of my NodeID? (i.e. what branches are derived from
my NodeID, what branches am I derived from, and recursively find the
branch trees found from each of those above questions. This query path
is necessary in order to provide a "What branch do you want to merge
from?" dialog.

However, that's just a wish not a must have. I do know that VSS has one
of these dialogs as a point of reference though.

> > So to sum up, we gain Changeset tracking, reduce the cost of
committing
> > a SVN Transaction, preserve our current cost of ancestor tracking,
and
> > make the SVN data model more amenable to being represented by a SQL
> > Store.
>
> Me like. :-)

Indeed, things are even possibly cooler than that. Since we now have
pending Chagesets (s/Transaction/Changeset), you might get the idea that
dealing with distributed repositories might be started off by simply
expanding the NodeChange PK to (ReposID, NodeID, ChangeSetID) where
ReposID is a GUID. You can quickly invision how you could write a svn
pull which would create a pending changeset, etc... You'd also want to
change the ChangeSet PK to: (ReposID, ChangesetID) etc...

But that's a lot more complicated, and maybe Zack can run with that
train of thought.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Mar 8 19:54:55 2002

This message: [ Message body ]
Next message: Bill Tutt: "RE: RE: Keeping NodeRevisionIDs from being unbounded"
Previous message: Karl Fogel: "Re: svn commit: rev 1464 - trunk"
Maybe in reply to: Bill Tutt: "Keeping NodeRevisionIDs from being unbounded"
Next in thread: Bill Tutt: "RE: RE: Keeping NodeRevisionIDs from being unbounded"
Reply: Bill Tutt: "RE: RE: Keeping NodeRevisionIDs from being unbounded"
Reply: Karl Fogel: "Re: Keeping NodeRevisionIDs from being unbounded"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]