getting nodes by ID

From: Jim Blandy <jimb_at_zwingli.cygnus.com>
Date: 2001-03-05 23:13:49 CET

DAV wants to be able to retrieve nodes by DAV-ID. This DAV-ID doesn't
have to be a Subversion FS node revision ID --- it can be any
arbitrary string of bytes such that:
1) if two nodes' DAV-ID's are equal, their contents are equal, and
2) it's reasonably fast to retrieve a node given a DAV-ID.

The idea, I think, is that when you ask for the contents of a
directory, you'd get back a list of (name, DAV-ID) pairs, and then you
could retrieve the contents you were interested in by DAV-ID. A
caching proxy would use DAV-ID's to recognize text it had seen before.
So we should add the following property:

3) As often as possible, if a node's contents have not changed, its
DAV-ID shouldn't change.

The first idea was that we'd simply use filesystem node revision ID's
as DAV-ID's. That certainly meets the requirements above. But that
won't work, because the filesystem can't tell whether you are
authorized to access a file without knowing the path through which you
reached it --- if you don't have permission to read some parent
directory, you shouldn't be able to read its contents, no matter what
the permissions are on them. So we need to add:

4) Given a DAV-ID, there must be a quick way to determine whether the user
is authorized to access the node.

The next suggestion was to use a pair (PATH, NODE-REV-ID). But PATH
is useless without a revision or transaction whose root we can start
from. The database isn't indexed to quickly answer questions like "in
what revisions does PATH refer to NODE-REV_ID?" (I think such an
index would completely duplicate the directory structure, unless
someone thinks of something clever.)

Another possibility would be to use a pair (REV, PATH), where REV is
the oldest revision in which PATH refers to the node we want. This
satisfies all four properties, but there's no quick way to choose the
right REV, without trying them all. Since the "oldest" part is only
necessary to satisfy condition 3), we can relax it a bit to "the
oldest revision we can find". But this is still a mess. We're doing
an unbounded number of path traversals in the database. So we need to
clarify:

5) It should be quick to compute a suitable DAV-ID for each node as we
traverse the filesystem.

Would it be possible to change DAV to remove condition 2)? That is,
can we use DAV-ID's to notice cache hits, but supply extra information
to get the real thing when we miss? That way, we could use node
revision ID's for DAV-ID's, but also supply a (ROOT, PATH) pair that
could be looked up easily.

Also, it seems to me like the problems we're having in implementing
this aren't really unique to Subversion. Can you show us a system
where DAV-ID's are easy to implement, with all the desireable
properties? For example, what if I'm implementing DAV on CVS --- what
kind of DAV-ID would you use there? You could use (PATH, RCS-NUMBER),
but only because CVS doesn't really allow you to delete directories,
or support renaming. I guess I'm beginning to wonder if this isn't a
half-baked idea in general, not just a problematic case for
Subversion.
Received on Sat Oct 21 14:36:25 2006

This message: [ Message body ]
Next message: Karl Fogel: "Re: delta metric"
Previous message: Jim Blandy: "Re: CVS update: subversion/subversion/tests/libsvn_fs fs-test.c"
Next in thread: Karl Fogel: "Re: getting nodes by ID"
Maybe reply: Karl Fogel: "Re: getting nodes by ID"
Reply: Greg Stein: "Re: getting nodes by ID"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]