Re: getting nodes by ID

From: Greg Stein <gstein_at_lyra.org>
Date: 2001-03-09 13:07:57 CET

On Mon, Mar 05, 2001 at 05:13:49PM -0500, Jim Blandy wrote:
>
> DAV wants to be able to retrieve nodes by DAV-ID. This DAV-ID doesn't
> have to be a Subversion FS node revision ID --- it can be any
> arbitrary string of bytes such that:
> 1) if two nodes' DAV-ID's are equal, their contents are equal, and

Yes. And:

1.1) if their contents are equal(*), their DAV-IDs are equal.

(*) equal in the sense that the nodes have some relationship; if two files
*happen* to have the same content, then there is no expectation of
"equality" here; and "node" needs to be defined better -- it could refer to
"node means rev/path pair" or "dag_node"; two dag_nodes will almost
invariably not be equal.

> 2) it's reasonably fast to retrieve a node given a DAV-ID.

Define "fast". I would hope that you don't have to do linear searches to
retrieve the node. I'd hope for O(1).

IOW, "fast" == "not linear time"

> The idea, I think, is that when you ask for the contents of a
> directory, you'd get back a list of (name, DAV-ID) pairs, and then you
> could retrieve the contents you were interested in by DAV-ID. A
> caching proxy would use DAV-ID's to recognize text it had seen before.

Exactly.

> So we should add the following property:
>
> 3) As often as possible, if a node's contents have not changed, its
> DAV-ID shouldn't change.

Yes.

> The first idea was that we'd simply use filesystem node revision ID's
> as DAV-ID's. That certainly meets the requirements above. But that
> won't work, because the filesystem can't tell whether you are
> authorized to access a file without knowing the path through which you
> reached it --- if you don't have permission to read some parent
> directory, you shouldn't be able to read its contents, no matter what
> the permissions are on them.

Yup to all.

> So we need to add:
>
> 4) Given a DAV-ID, there must be a quick way to determine whether the user
> is authorized to access the node.

Define "quick" :-) ... again, I'd say "not linear time".

> The next suggestion was to use a pair (PATH, NODE-REV-ID). But PATH
> is useless without a revision or transaction whose root we can start
> from. The database isn't indexed to quickly answer questions like "in
> what revisions does PATH refer to NODE-REV_ID?" (I think such an
> index would completely duplicate the directory structure, unless
> someone thinks of something clever.)

All right. We'll posit this as "not feasible" for now...

> Another possibility would be to use a pair (REV, PATH), where REV is
> the oldest revision in which PATH refers to the node we want. This
> satisfies all four properties, but there's no quick way to choose the
> right REV, without trying them all. Since the "oldest" part is only
> necessary to satisfy condition 3), we can relax it a bit to "the
> oldest revision we can find". But this is still a mess. We're doing
> an unbounded number of path traversals in the database.

Yup.

> So we need to
> clarify:
>
> 5) It should be quick to compute a suitable DAV-ID for each node as we
> traverse the filesystem.

Yup.

> Would it be possible to change DAV to remove condition 2)?

Well, it is reasonable assumption that the user will typically be fetching
the latest. IOW, heuristics will typically make this fast.

> That is,
> can we use DAV-ID's to notice cache hits, but supply extra information
> to get the real thing when we miss? That way, we could use node
> revision ID's for DAV-ID's, but also supply a (ROOT, PATH) pair that
> could be looked up easily.

If ROOT appears within the DAV-ID, then you've lost condition 1.1.

> Also, it seems to me like the problems we're having in implementing
> this aren't really unique to Subversion. Can you show us a system
> where DAV-ID's are easy to implement, with all the desireable
> properties?

A table mapping UUIDs to (REV, PATH) pairs. The UUID design has already been
discussed as an approach (I'd say somebody is planning to use it, but most
people on the DeltaV list are proprietary vendors and barely even say what
they're building, let alone their design).

If you have a versioning system which stores a new file for each change,
then you'd just use the internal pathname to that file.

If you had a database of files, with a file per row (nodes!), then you could
use the node id. (we're throwing acls into the mix, which bungles up this
approach, but I bet that is about the data modelling rather than a statement
about the validity of the requirement)

> For example, what if I'm implementing DAV on CVS --- what
> kind of DAV-ID would you use there? You could use (PATH, RCS-NUMBER),
> but only because CVS doesn't really allow you to delete directories,
> or support renaming.

That is exactly what I'd use because it works very well. "but only" is moot.
It's the CVS design, and we can take advantage of it.

> I guess I'm beginning to wonder if this isn't a
> half-baked idea in general, not just a problematic case for
> Subversion.

It is emphatically NOT half-baked.

The DAV-ID concept is really just another name for a version resource URL.
These URLs are intended to be "stable" forever. That is, they don't
disappear and their contents don't change. Ever. I can send you one of these
URLs, and I don't need to worry about later updates blowing over what I
sent. The part about keeping them the same is to prevent the implication
that changes have occurred (you make value judgements given VR1 and VR2
without needing to fetch both of them).

The network performance implications are tremendous, if we can meet
condition 1. And I mean BIG.

[ just ask me how happy SourceForge would be, to be able to use a farm of
caching reverse proxies in front of their SVN servers; or a group of
people using transatlantic links via a caching proxy ]

As Karl points out, if you have (R1, PATH) and (R2, PATH), and you want to
know that they are equivalent, then how would you do it? We're answering it
using DAV-ID, but that is just a means to an end. IOW, the comparison is the
important part (ie. comparison leads to dav-id, not the other way), and it
has happened to be exposed through our DAV operation.

[ that said: I'm not entirely sure of the "proper" way for the server to
expose the concept; for example, how does it work over ra_local ]

It should not be hard to compute OLDEST-REV for any given node. Just record
it in the node when it is created. The obvious problem is that we only have
TXN when we're creating the node, not REV. This would imply the need for
retaining a TXN -> REV mapping; to compute OLDEST-REV, you'd take the nodes'
TXN and pass it thru the mapping to get a REV. The mappings could sit in the
transactions table (possibly by storing a skel such as ("committed" REV)),
or a fourth table could be used.

Let's take a different approach here. We don't need to answer the question
of "what is the oldest revision this node appears in?", but we need to
answer "what are the ACLs that apply to this node?" We posited this in (4),
but kept trying to answer the former question.

How about this: when a node is created, we record the ROOT-ID (of the txn)
in the node. To determine the ACLs that apply to any given node (and PATH!),
we bounce up to the node's ROOT-ID, then traverse the PATH for ACLs.

Storing the ROOT-ID should allow us to use (ID, PATH) as the DAV-ID, which
meets every condition that we've specified.

-------- Sidetrack Start

Now, there is one more, very subtle "gotcha". Let's say that I have a node
that is unchanged between R1 and R2. The delta for R1 -> R2 is simply an ACL
change on the root directory to disable reads. We now have a problem in that
the DAV-ID for (R2, PATH) is effectively (R1, PATH), so the ACL change won't
take effect. Oops! :-)

Answer: the DAV-ID becomes (ID, PATH, ACL-MD5), where ACL-MD5 represents an
MD5 checksum of all ACLs on PATH. Effectively, we're saying that a node's
state is more than just its contents: it also includes all the ACLs on its
parents.

Of course, the above "answer" doesn't work. We don't have a way to recover
the appropriate ROOT-ID for a given DAV-ID (i.e. we can't recover R2).

Answer #2: if the ACL-MD5 for (REV, PATH) is the same as the ACL-MD5 for
(ROOT-ID, PATH), then we use (ID, PATH) as the DAV-ID. If they are unequal,
then we use (ID, PATH, REV). Of course, this bungs things up because once an
ACL changes, then the DAV-ID tracks REV, which breaks condition 1.1.

Answer #3: this one gets a bit more complicated, and is not O(depth), but
needs to scan some nodes during DAV-ID construction. Essentially, the DAV-ID
becomes (ID-1, ID-2, ..., ID-n) (one for each component in PATH) such that
ID-i is the oldest revision where the ACL-MD5 for the tail-path matches the
ACL-MD5 for the current revision.

-------- Sidetrack End

The problem is that a given resource state varies based on external factors.
If I'm referring to some node ID, its "value" changes based on things
happening elsewhere in the tree.

If you stop and consider that the R1 -> R2 change imposed a different set of
ACLs on the same (unchanged) resource FOO, then the *effective* ACL is the
less restrictive of the ACLs on FOO's PATH in R1 and R2. Let's say that I
added restrictions. No big deal, I can back up to R1 and fetch the file; the
additional restrictions in R2 are useless. Let's say that I relax the
restrictions, then I can move forward to R2 to get the resource, where I
couldn't thru R1.

I believe that to section in "Sidetrack" needs to wait on an answer to
whether/how we want to solve the application of changing ACLs on the
parent(s) of a given resource. For example, we could state the policy as
"access to (ID, PATH) is based on the least restrictive set of ACLs on PATH
for each REV that selects ID at PATH." Or we could say "access is based on
the ACLs on PATH for the oldest REV that selects ID at PATH." Or the latest
REV.

The basic issue is one of:

1) if the ACL applied is based on the ("desired"-REV, PATH) pair, thus
   allowing success/failure even though (OTHER-REV, PATH) is different, then
   we need to distinguish every single (REV, PATH) pair so that each one can
   have is own success/failure conditions.
   [ and including REV busts 1.1 ]

2) if the ACL applied is unified/specified in some way to be "the ACL from
THIS-REV", then we just need a way to calculate THIS-REV (based on our
policy assertion) at DAV-ID construction time.

When we form a policy, then we go back to solving DAV-ID.

I'm not sure on what I'd suggest for an ACL application policy. I think we
mentioned the different-ACL-via-different-paths possibility before, but I
never *really* thought about the full implications until now. At this point,
I don't have a feeling for the "right" user model here. I believe that I'm
tending towards "apply the ACL from the latest REV that selects (ID, PATH)"
on the basis that the later ACLs are your true intent for the tree.

Hmm. Actually, I'm pretty solid in that camp. Consider doing a "tag"
operation for a release. You copy /main to /tags/1.0b1 and commit (as 1704).
Then you decide to make it read-only, so you plop a read-only ACL on the
directory (and commit as 1705). We want everything in that directory to use
the new ACL. No changing the release via 1704.

Assuming we use this policy, then I can craft an algorithm for DAV-ID
computation.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Received on Sat Oct 21 14:36:25 2006

This message: [ Message body ]
Next message: Greg Stein: "pool usage (was: Re: CVS update: subversion/subversion/libsvn_fs rev-table.c)"
Previous message: Greg Stein: "Re: delta metric"
In reply to: Jim Blandy: "getting nodes by ID"
Next in thread: Jim Blandy: "Re: getting nodes by ID"
Reply: Jim Blandy: "Re: getting nodes by ID"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]