RE: Re: Maintaining NodeID sanity

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-05-08 07:00:52 CEST

> From: Karl Fogel [mailto:kfogel@newton.ch.collab.net]
>
> "Bill Tutt" <rassilon@lyra.org> writes:
> > Efficiently accessing ancestor lists:
> > Karl's concerned with the change from NodeRevisionID to (NodeID,
> > ChangeSetID) because it removes the ancestry information from the
row.
> >
> > Bill's reply:
> > This is true. I think you still want to store the entire predecessor
> > list inline using some kind of (hopefully server side merge history
> > recording friendly) string format. Here's an example format that
allows
> > server side merge history recording.
> > E.g: in a Python like way:
> > # Ancestor history list:
> > [
> > # Initial revision NodeChangePK info:
> > [(NodeID1, ChangeSet1)],
> > # Incremental change of this Node:
> > [(NodeID1, ChangeSet2)],
> > # A merge from a different file in the repository (again, just the
> > # NodeChangePK facts here):
> > [(NodeID2, ChangeSet2)],
> > # A star merge from two separate files in the repository:
> > [(NodeID2, ChangeSet3), (NodeID3, ChangeSet4)]
> > ]
> >
> > Just because we're moving the data out of the PK doesn't mean we
can't
> > keep the data around in the row. We just don't want to use it in the
PK.
> > :)
>
> Sure... But the common case so far has been successive single-ancestor
> changes to each node (where the single "ancestor" is the predecessor,
> of course), and we've been able to record that in a highly compressed
> way by using place-value notation: When you see that your ID is
> `X.some_integer', you know to count nodes X.1 through
> X.(some_integer-1) as ancestors. This trick doesn't work when the
> node id is `X.unique_txn_id'.
>
> So my problem isn't that the information is gone from the primary key;
> we can still store it in the row, that's fine. The question is,
> *what* are we storing in the row? If the representation of the
> ancestry line grows linearly with the ancestry line itself, I submit
> that we have a problem. Subversion is supposed to scale :-), not just
> in data size, but number of changes to a given node.
>
> > Put another way: Blob's are bad in index keys (i.e. Primary key),
but
> > are perfectly ok as columns in your table.
> >
> > For BDB you could always create a separate NodeChangeAnecstorBlob
table
> > if it's deemed that it makes more perf sense that way.
>
> This is the assertion I need to be comforted about... :-)
>
> Blobs are okay unless you're searching in them often for information.
> If we have to constantly load and/or scan a whole blob to find out
> whether node FOO is an ancestor of BAR or not, that seems like a bit
> of downgrade from the current system.
>
> Am I worrying about nothing? Look in the filesystem at where we do
> ancestry and predecessor checks... I just don't see how it can
> possibly be efficient enough to use this ancestry mechanism :-(.
>
Most of these answers rely on BranchID being implemented since they
restore NodeID to sanity and lets me make the following comments:

Let's define a handy function in (NodeID, BranchID, ChangeSetID) land
because we'll be seeing it a lot:
AreNodeChangesTimelesslyRelated(Source, Target)
{
return (Source.NodeID == Target.NodeID &&
Source.BranchID == Target.BranchID);
}

Ok, let's check our ancestry examination functions out:
svn_fs__id_is_parent: only svn_fs_txn_path_is_id uses this function.
Returns true if A is a parent of B. i.e. if A == B then false.

svn_fs__id_is_parent is only called by:

svn_fs_txn_path_is_id: mod_dav_svn uses this function. I don't know why
mod_dav_svn uses this function. I don't quite understand the WebDAV
context this is occurring in, so I can't analyze this one very well. I'm
guessing WebDAV is trying to figure out if a directory is the correct
version or not, and it only wants to succeed if it cheaply appears to be
auto-mergable. So I'd guess a simple check with
AreNodeChangesTimelesslyRelated should do the trick.

svn_fs__id_is_ancestor: The only caller of this is directory merging.
Returns true if A == B or A is parent of B

Regarding the directory merging call site:
Both of the cases below revolve around whether directory auto-merging is
performed only by a commit or not.

Given the current schema proposal, I'd argue that merge() needs to know
if it's participating in a commit or not, so that the logic can optimize
on that fact and not examine ancestor history any more than necessary.

In fact both of these listed cases could use
AreNodeChangesTimelesslyRelated if merge() knew that a commit was in
progress.

I thought about suggesting assuming that merge() always gets called
during a commit, but that'd be incorrect if we suddenly allowed
long-term pending transactions that could be extracted into working copy
for various reasons. i.e. either inter-repository merges, or other
various odd reasons like being able to save your in-progress work tree
into a pending-ChangeSet.

Case 1: Used to verify that Source is ordered after Target which is
ordered after Ancestor in a (NodeID, BranchID, ChangeSetID) sense after
the schema change.

During a commit, this is guaranteed by our caller, relaxing to
AreNodeChangesTimelesslyRelated is ok here.

Case 2: Used to detect unrelated-ness between Source and Transaction
We know that S != T because we just got done checking that.

During a commit, S can't be a successor of T because T hasn't been
committed yet, so we could optimize the ancestor check out.

svn_fs_id_distance is called by the following functions:

svn_fs_check_related: Uses it to prove that they're related via ancestry
somehow. (including via copy history)

The new schema offers check_related some more attempts for optimization
before it has to call id_distance.

However, check_related isn't called atm!

tree.c:merge(): Directory merging
Source == proposed HEAD NodeChange for directory
Target == current HEAD NodeChange for directory
Ancestor == common parent NodeChange for directory
Case 1: Used to re-id a directory node if necessary due to trying to
auto-resolve directory versioning, and trying to be nice about
generating complete auto-merge history for directories.

I don't think this makes any sense any more given the new NodeChange PK.
A successor is simply a new (NodeID, BranchID, ChangeSetID) tuple that's
just later in time. It's not even like the auto-merging delta details
operation are recorded separately, the original version of the
NodeChange is deleted.
This made more sense with the old NodeRevisionIDs.

Case 2: Target recreated something that was deleted in Source.
This is essentially an out-of-dateness check. If you're going to delete
something in Source that was updated in Target, then you should have a
more up to date directory then the one you tried to commit with. Update
and retry your svn commit.

Case 2 is an annoying race condition, but it's very important to double
check it. Fortunately, this code just needs to know if
AreNodeChangesTimelesslyRelated. (i.e. they differ only in ChangeSetID)

The reason for this is that time order between Target and Source never
applies for this case. Source doesn't exist to compare against, and
Ancestor is always <= Target by definition.

Libsvn_repos/delta.c:
repos_dir_delta():
Only called from mod_dav_svn, svnlook, libsvn_repos/log.c, and
libsvn_repos/reporter.c
Used to détermine the appropriate delta operation to perform/report
based on the returned id distance.

This can use our new found friend AreNodeChangesTimelesslyRelated
because the time order isn't important here.

Find_nearest_entry:
      /* Find the distance between the target entry and this source
         entry. This returns -1 if they're completely unrelated.
         Here we're using ID distance as an approximation for delta
         size. */

No escape here, we have to do the work atm, unless we want to use a
different heuristic.

delta_dirs:
/* Check the distance between the ids.

0 means they are the same id, and this is a noop.

                 -1 means they are unrelated, so try to find an ancestor
                 elsewhere in the directory. Theoretically, using an
                 ancestor as a baseline will reduce the size of the
deltas.

                 Any other positive value means the nodes are related
                 through ancestry, so go ahead and do the replace
                 directly. */
We can use AreNodeChangesTimelesslyRelated here as well. If
AreNodeChangesTimelesslyRelated returns FALSE, then we'll call
find_nearest_entry which will do the really expensive work for us. No
need to call the expensive check twice for one chunk of data.

The important conclusion:
The directory auto-merging process is now ancestor checking free for a
commit. However, delta editor work still can require arbitrary ancestry
hopping, but only some of the time (when find_nearest_entry gets
called). Additionally, anybody who ever calls svn_fs_check_related is
just going to have to wait, but that was true before the schema change.

Phew... talk about a lot of work,
Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed May 8 07:01:52 2002

This message: [ Message body ]
Next message: Bill Tutt: "RE: RE: Re: Maintaining NodeID sanity"
Previous message: Karl Fogel: "regarding issue #582 and revision 1902"
In reply to: Karl Fogel: "Re: Maintaining NodeID sanity"
Next in thread: Bill Tutt: "RE: RE: Re: Maintaining NodeID sanity"
Reply: Bill Tutt: "RE: RE: Re: Maintaining NodeID sanity"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]