Re: getting nodes by ID

From: Greg Stein <gstein_at_lyra.org>
Date: 2001-03-13 19:18:20 CET

On Mon, Mar 12, 2001 at 11:46:22PM -0500, Jim Blandy wrote:
>
> (Is there a better term than DAV-ID? I'm happy to use the real DAV
> terminology, but I don't know what it is.)

There isn't a term for it. We are using a "DAV-ID" as a component of a
Version Resource URL (VR URL). The latter is our goal, and as long as the
URL retrieves a resource, it should retrieve exactly the same resource. I
don't think we want to have a VR that disappears in normal operation. That
capability is present for version control systems that delete (intermediate)
versions; we might use it when we archive old revisions, and remove them
from the repository. But in normal operation, a VR URL will always exist and
always point to the same thing.

The problem that we're seeing is that a VR URL which just uses an (ID, PATH)
pair can be insufficient for ACL processing (because we don't know which
root to use with PATH).

>...
> At the moment, there's a requirement that a DAV-ID alone is sufficient
> to efficiently retrieve a node: a request can just take a DAV-ID and
> hand you back its contents, quickly.
>
> What if we changed that rule so that the DAV-ID alone was sufficient
> to check identity --- "Okay, this file you've asked for is the same as
> one I've already got" --- but you need additional information to
> actually get the file, if the identities don't match. Then we could
> use node revision ID's as DAV-ID's --- if two node revision ID's are
> identical, then you've got a cache hit --- and supply a (REV,PATH)
> pair for retrieval. So you haven't lost condition 1.1 ("if their
> contents are equal(*), their DAV-IDs are equal."), since you're only
> comparing node revision ID's.

The VR URL is the thing used for retrieving the resource. The importance of
identity is twofold: 1) the client knows the VR URL for the resource in the
WC, so it can easily determine whether a new VR URL identifies a new one
resource (which needs to be fetched); 2) a cache uses a URL as a key, so
stability in the URL improves caching (where identity means the contents are
the same).

I think what you're saying is that we have two URLs. One that isn't really
used for fetching, but acts as an way to check identity. Another to actually
perform the fetch. The fetching URL would blow off caching because it
includes REV in the URL.

And to recap: if a resource doesn't change between revisions 3 and 4, we
want the client and any caches to keep their contents. Since these are keyed
by the URL, any change in the URL (s/3/4/) is going to cause a refetch or a
cache miss.

> > > Also, it seems to me like the problems we're having in implementing
> > > this aren't really unique to Subversion. Can you show us a system
> > > where DAV-ID's are easy to implement, with all the desireable
> > > properties?
> >
> > A table mapping UUIDs to (REV, PATH) pairs.

To clarify the above: I meant we return UUIDs as a means of uniquely
identifying (REV, PATH) pairs. This scheme is being used by some systems
where /a/b/foo.c can exist as rev 1.33, get deleted, and brought back
starting at rev 1.1. Thus, a URL which has (say) /a/b/foo.c?rev=1.33 will
not be unique over time. Using a UUID provides for distinguishing between
the two generations of foo.c.

Our problem is that a change to foo.c spawns a "logical" change to bar.c
(the global revision number bump). We are trying to avoid the caching and
identity problems around that implicit bump.

> So, each time I commit a new node revision as part of some
> transaction, I place a flag or property on that node revision giving
> the REV and PATH in which it first appears. Easy enough. From then
> on, DAV can use that (REV, PATH) as the DAV-ID, even when accessing
> revisions in which that path no longer exists, or no longer refers to
> that node revision.

I don't get this sentence. We'd use the (REV, PATH) pair to access just that
(old) revision and path. If a later revision deletes that PATH, then we
won't be attempting to fetch the resource any more. And if the node revision
changes, then presumably we have a new (REV, PATH) pair to fetch the new
node revision.

> And unlike a node revision ID, the filesystem can
> check authorization properly given a (REV,PATH) pair. So that meets
> all our requirements, I think.

Yes, this works. In one of my solutions, I mentioned storing the ROOT-ID on
the node. When ID is looked up, we fetch the ROOT-ID, and examine PATH for
ACLs. Storing a (REV, PATH) is about the same: use the REV to fetch ROOT-ID,
then examine PATH. Unfortunately, we want the DAV-ID to be different for two
PATH values so that we check the appropriate ACLs.

[ and yes: maybe we decide we *don't* want them different... if you can't
access the node via one path, then should they be allowed to access via a
different path? IOW, are ACLs applied to nodes rather than paths? ]

> The problem here is that if someone sets an ACL on some old revision,
> making it inaccessible, then suddenly all the nodes in new revisions
> whose DAV-ID's happen to refer to the now-restricted revision become
> inaccessible.

Woah. I always assumed that an ACL change was a property change (thus
requiring a revision bump), yet it sounds like you're saying they occur
outside the realm of node revisions.

Well... no matter. Recall, the problem is to identify a (ROOT-ID, PATH) pair
to process the ACLs. The ACL processing is orthogonal to the DAV-ID.

Ah. I see what you're saying. We apply an ACL to *just* revision 3, but want
to leave revision 4 open. If people are accessing rev4 using DAV-IDs that
refer to rev3, then they're SOL.

Yes... that is the problem that I was trying to solve in my "sidetrack". How
do we detect ACL changes so that we can generate a DAV-ID that identifies
the "correct" (ROOT-ID, PATH) pair. However, I always assumed that ACLs
could not be retroactively applied. *If* we were to allow that, then we are
going to have a very tricky problem to solve.

My current position on the problem is generating a DAV-ID that can be
defined pretty much as:

  The DAV-ID for node N should identify a (ROOT-ID, PATH) pair such that
  ROOT-ID is the "earliest" ID of all revision roots which specify N at PATH
  and have the same ACLs along PATH as the "latest" ID of those revision
  roots.

This effectively means the ACLs for access to N is defined by the youngest
of all roots which specify that N at PATH. Once you have N+1, then a
different set of ACLs apply.

Hmm... I just realized that we don't want to determine the ACL set to apply
based on DAV-ID. Somebody could just use an old VR URL to avoid any recent
ACL changes (not to mention, that I'd still hope the VR URL remains
unchanged for optimal caching on clients and proxies). It appears the
problem is about finding the right set of ACLs given an arbitrary node.
Sigh... harder.

> > If you have a versioning system which stores a new file for each change,
> > then you'd just use the internal pathname to that file.
>
> I think this has the same problems as using node revision ID's. Let
> me work the analogy through and see if it applies:
>
> There are two possibilities: either different revisions of a tree
> share or don't share these per-change files. I'm not assuming
> Subversion's global revision number model here. I just mean, select a
> tree by date or by tag or by branch or whatever: do the same internal
> pathnames appear in different trees?

You would ask the server for the set of VR URLs that match some criteria.
Then you go and fetch them. There isn't a "tree" that is exposed for all
possible queries. So yes, the same VR URLs could appear in all the different
query responses.

> They must, or else you've lost requirement 1.1, which is the whole
> point of the game.
>
> So if they do share them, and we support directory renaming and
> deletion, then given a particular internal pathname, it must have more
> than one "parent" path --- I mean a parent in the version controlled
> tree, not a parent in the internal tree structure. And the more trees
> in which it appears, the more parents it may have.

You're talking about "Versioned Collections" here. If you don't have
versioned collections, then you don't have multiple (logical) trees. You
just have COLL with a bunch of versioned members. There aren't two versions
of COLL, each with their own set of members.

When you throw versioned collections into the mix: yes, there are
significant problems :-)

> So when I present a DAV server with one of these DAV-ID's, how can it
> determine whether I am actually authorized to access it? It doesn't
> know which path of parents to check.

In HTTP (DAV, DeltaV), the ACL model applies to a singular resource. Yes,
internally it may have inherited a bunch, or it may scan a bunch of data for
it, but to the *client* of the server, there is only a single question: "can
I perform METHOD on RESOURCE?" And it isn't really even a question -- you
just try it and get success or failure. There are some query operations to
fetch information to help you determine whether it will work or not, but the
final arbiter is just performing METHOD.

The implications of above is simply that which/what/how ACLs are determined
and applied are "out of scope". There is a WebDAV ACL spec, but it is still
focused around "apply THIS ACL to THAT resource." In the presence of
versioning, and versioned collections, and whatever else, it doesn't provide
for the full rules.

So the short answer to "how can it determine..." is simply "beats me; that
is up to the server to figure out." Granted, not helpful, but that is the
meat of it.

> > If you had a database of files, with a file per row (nodes!), then
> > you could use the node id. (we're throwing acls into the mix, which
> > bungles up this approach, but I bet that is about the data modelling
> > rather than a statement about the validity of the requirement)
>
> Well, but that's exactly my question: I can't figure out how to
> implement DAV-ID's without throwing away some of the ACL behavior
> which we had thought was reasonable. If you think we should revise
> that part of the ACL idea, so that one can access nodes even if one
> doesn't have "execute" access (or whatever we call it) to its parents,
> then that's a point to discuss. But I think folks will be surprised
> to learn where that requirement comes from.

Oh, I agree: parents should apply. My statement above was about constructing
DAV-ID values (well, VR URLs) from available data. Not a statement about the
validity of ACLs.

>...
> What I meant was, "this works, but it doesn't help us, because we
> don't have that simplification." My whole point is that DAV-ID's
> don't seem to mix (as far as I can tell) with ACL's on parents that
> restrict access to children (as, say, Unix execute permissions on
> directories do), when a child can appear in multiple parents.

Partially agree: with our current system, they don't appear to mix well. But
I think that is caused by two problems:

1) we don't have a firm grasp of what we want the ACL model to be, within
the space of multiple access paths to a given node
2) our data modelling

Once we decide on (1), then we can figure out (2). If (2) doesn't match what
we currently have, then we change the data model, or we change our ACL
rules, or we live with some caveats, or whatever. But we aren't there yet.

[ I'm bringing in data modelling because it can easily be said that part of
our problem is the inability to easily answer questions like "what is the
latest revision where PATH points to N" as part of the ACL application ]

>...
> > It should not be hard to compute OLDEST-REV for any given node. Just record
> > it in the node when it is created. The obvious problem is that we only have
> > TXN when we're creating the node, not REV. This would imply the need for
> > retaining a TXN -> REV mapping; to compute OLDEST-REV, you'd take the nodes'
> > TXN and pass it thru the mapping to get a REV. The mappings could sit in the
> > transactions table (possibly by storing a skel such as ("committed" REV)),
> > or a fourth table could be used.
>
> As I said above, I'd be happy to record this info in the node revision
> when it's created. That's equivalent to your suggestion, I think.
> The problem is that now the user needs to be authorized to traverse
> both of two distinct paths to reach a node --- both the path in the
> revision he's actually working in, and the path in OLDEST-REV. We
> only want the former to matter.

Well, that's a discussion point :-) ... Only the former? Then what use is
the ACL, if they can simply use a different REV or PATH to access the
resource? Do we take the set of all (REV, PATH) that access N and compute
the strictest or the least-strict ACL for N? etc.

>...
> > Storing the ROOT-ID should allow us to use (ID, PATH) as the DAV-ID, which
> > meets every condition that we've specified.
>
> Yep. Same problem as (OLDEST-REV, PATH), though. (Equivalent, I think.)

Yes.

> I didn't really understand your Answer #2 and Answer #3. But you
> raise a critical point: we don't really know how ACL's are going to
> work --- frankly, I think it might be a challenge to get useful
> semantics, so we may see some severe hair --- and since the exact
> problem here is the interaction between authorization and DAV-ID's, we
> need to understand how ACL's are going to work before we can tackle
> this. Just recognizing that there's an interesting interaction there
> is a good thing.

Agreed :-)

> Pending the addition of ACL's to the system, I don't see any reason
> not to use node revision ID's as DAV ID's.

That is certainly a good intermediate step. And since we'll be treating VR
URLs as opaque entities, we can change it on the server without a problem.
(the client could end up with "stale" VR URLs, but it is possible to refetch
new ones from the server)

I also note that we appear to be stretching past our mantra of "for 1.0,
meet CVS's features, yet pass it for low-hanging fruit". CVS simply has "if
you can authenticate with this server, then you're allowed." People
typically layer on per-module ACL support, but they apply at a very gross
level and quite independently of the revisions. We can *match* that
capability just with Apache's ACLs; no need for ACLs in the FS.

I'm going to capture a few of the key issues in a notes/acl-issues.txt for
later discussion. Until we tackle that, I'll use (ID, PATH) within the VR
URLs.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Received on Sat Oct 21 14:36:25 2006

This message: [ Message body ]
Next message: Karl Fogel: "Re: Check min and max num targets in client args patch."
Previous message: Ben Collins-Sussman: "wc props"
In reply to: Jim Blandy: "Re: getting nodes by ID"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]