architectural and project plan improvements

From: Tom Lord <lord_at_regexps.com>
Date: 2002-12-14 22:25:52 CET

[Arch-dev list readers: this sketch suggests a clean architectural
approach to the "revision library drives repository" idea that is
sometimes kicked around. It arises out of some talk on svn's dev list
and is proposed there as a cleaner interface for svn-style storage
mgt.]

jimb:

> wrinkles increase the initial investment of mind needed to
> use the thing to the point where the project just isn't as
> exciting any more to the hacking public. They raise the
> hacktivation energy needed.

Yes, excessive intertwingling subverts good software and yadda yadda
yadda. So, that's the useful software mysticism.

Ok, let's give ourselves something more concrete than "hacktivation
energy" to talk about. In this message, I'll just sketch a revised
spec for the svn storage manager and how it relates to a revision
control client. It may seem a bit foreign or "far off the current
path" at first glance -- but I think closer examination will show that
this is a practical alternative path that leads (roughly speaking) to
a superset of your 1.0 target, superior in functionality,
simpler/cleaner in implementation -- it will _save_ you work in the
short-to-medium term.

Let's suppose I want to define an FSDB -- a "file system database".
FSDB is in the same general category as OODB or RDBMS, but it differs
from those other systems in terms of how storage is organized and
accessed.

An FSDB has (something close to) the following access methods:

GET <txn> <path> [<start> <length>]
PUT <txn> <path> [<start> <length>] <data>

Retrieve (store) the contents (or partial contents) of a
file.

RENAM <txn> <from-path> <to-path>
RNMDIR <txn> <from-path> <to-path>

Rename a file (directory).

COPY <txn> <from-path> <to-path>

           Copy a file. This could in principle be synthesized from
           GET and PUT, but this method does not require transfer of
           data back and forth from storage manager to client.

CLNDIR <txn> <from-path> <to-path>

Clone a directory tree. Semantically, a `cp -r' operation.

LIST <txn> <dir-path> [<options>]

Retrieve a list of files in a directory.

        REMOVE <txn> <dir-path>
        RMDIR <txn> <dir-path>
        RMDIRR <txn> <dir-path>

Remove a file (empty directory, non-empty directory).

My (unverified) understanding is that some of the BSD file systems and
some linux file systems and recent NFS RFCs have limited forms of file
properties. The following methods capture a least common denominator
of those systems:

GETPRP <txn> <dir-path> <prop>
PUTPRP <txn> <dir-path> <prop> <data>

          Retrieve (store) a file property value. This _might_
          subsume STAT functionality (retrieve file size, inode
          number, etc. -- or maybe STAT is separate). Properties
          are (most likely) simple length-limited strings -- 0
          terminated for property names, binary for data.

Symbolic links and permissions:

SYMLNK <txn> <from-path> <to-path>
RDLINK <txn> <link-path>

Make (read) a symbolic link.

CHMOD <txn> <path> <mod-changes>
GETMOD <txn> <path>

            Change (query) ugo file permissions. Some native file
            systems now have access lists, and these methods might
            be able to handle those as well.

And the tricky ones:

        MKTXN <authdata> [r|w]
        ENDTXN <txn>
        KILTXN <txn>

Begin (end, kill) a transaction. All other methods may
only be invoked within a transaction.

Finally, let's add two operations that are critical to keeping network
traffic down:

DOPTCH <txn> <dir-path> <changeset>
MKPTCH <txn> <orig-dir-path> <mod-dir-path>

Apply (retrieve) a changeset in the format of RFC????.

I'm not going to speak, in this note, about how deadlocks are resolved
and so forth. Similarly, I'm not going to say anything about user
authentication.

Some things to note:

1) No (client visible) "repository version number".

2) Destructive operations (e.g. PUT, RMDIR).

3) No log messages.

4) Minimalist approaches to file properties and access lists.

        5) COPY and CLNDIR provide hints to the server about when to
           use delta-compressed storage. No (client visible)
           history-independent file ids.

        6) The changeset format is strictly orthogonal to everything
           else. It applies equally well to native file systems, for
           example.

        7) Structure and access methods modeled after native file
           systems. Indeed, this access protocol admits a very simple
           implementation that uses native file systems, hard links,
           and a few simple control files as a one possible
           implementation (moderately efficient, probably not
           making any use of delta-compressed storage, possibly making
           use of compressed storage).

8) No leakage to server-side of the concept of a working
directory.

Did I leave anything out? A system roughly like that should
characterize the repository in a C api and CLI. Both APIs should be
"network transparent". I believe that this is CLOSE to what you have
-- a nice target that could be hit via slight refactorings, clean-ups,
simplifications, and some careful thinking about auth and txn
semantics. It's a target that you _could_ conceivably hit by layering
_over_ the existing svn client libraries -- but that would be putting
the cart before the horse by doing far more work overall than is
needed. The end result will be a far more useful storage manager, a
cleaner architecture overall, a simpler implementation, a more readily
explainable rev. ctl client, and yadda yadda yadda. (At least, IMO.)

Now, guess what: that's _all you need_ to build a strictly layered
(client side) revision control system (and many other handy apps,
besides). For example, where you currently need the repository
version number, you can instead use information embedded in paths to
cloned directories. For revision control, I'd hope that you would do
this in two parts:

1) A set of formally speced conventions that describe how
revision control data is mapped into the file system.

2) Clients, in various styles, that put a user interface
on those conventions.

Arch has demonstrated (and given good design hints) that distributed
revision control can be implemented client-side -- mostly by picking
good, global names. A server-side MKPTCH method complements the
arch-style design beautifully.

Finally, if you want to get really serious about taking on OODB and
RDBMS systems, you can add something along the lines of:

        MKINDX <txn> <path> [<params>]
        IXPUT <txn> <path> (<key> <value>)*
        IXGET <txn> <path> <key>*

           Manage special files (not accessible with GET or PUT) that
           provide a low-level indexing facility. In good
           implementations, cloning one of these special files implies
           a space-efficient representation of the two resulting
           indexes.

        MKPGFL <txn> <path> [<params>]
        PFGET <txn> <path> <page-id>
        PFPUT <txn> <path> <page-id> <page-data>

Manage special files that are optimized for page-oriented
access.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Dec 14 22:14:43 2002

This message: [ Message body ]
Next message: Greg Hudson: "Re: gcc source management requirements"
Previous message: Branko ÄŒibej: "Re: svn diff, svn merge, and vendor branches (long)"
In reply to: Jim Blandy: "Re: gcc source management requirements"
Next in thread: Greg Stein: ""binding surfaces" (was: gcc source management requirements)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]