Towards a native-FS-backed filesystem

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2004-03-30 05:51:46 CEST

Here are my design ideas for a native-FS-backed filesystem. Feel free
to ask me for elaboration where I've been too concise; I didn't want
to write a book.

I don't think it's possible for one FS design to satisfy everyone.
You can't have a single design with low overhead, good scalability,
zippy speed, simplicity, flexibility with regards to access
permissions and underlying file store, and robustness. So, my goal is
to come up with something that satisfies a different (and still
reasonable) set of priorities from the one we currently have.
Specifically, I want to provide:

* Low overhead. If you have three 10K files in a repository, the
repository shouldn't be much larger than 30K.

* Low administration. No need to worry about running out of locks,
for instance.

* Scalability in the face of high read throughput and moderate write
throughput.

  * Flexibility. You should be able to host the repository on a
    network filesystem; all you need is the ability to read all the
    files in the repository and (for commits) the ability to insert a
    new file.

* Simplicity and robustness.

The sacrifices will be:

  * It might not be as zippy as the current back end. Because of my
    plan to reverse the direction of delta arrows (see below),
    checkouts of recent revisions may have to do more than they do in
    BDB, although we won't know for sure unless we implement and
    measure. We could stick a big plaintext cache in front of the
    repository to speed up checkouts, but that adds complexity and
    can't be a universal cure.

* It might not scale as well in the face of high write throughput.

Okay, enough about requirements. Here is my current design plan, from
a thousand feet up:

* Finished revisions are represented as a single file.

  * Node-revisions are located by a revision number and an offset in
    the revision file. (However, the revision file format will be
    plain text.)

  * All delta arrows point backwards in time, as opposed to our
    current ones which point forwards. Thus, a revision file never
    needs to be changed once it has been put into place. Skip-deltas
    and the delta-combiner ensure that we only need to read O(lg(n))
    deltas and combine them together in order to regenerate any given
    plaintext.

  * Transactions which have yet to be committed are represented as one
    file for the changed-file data (appended to as changed files come
    in) and a directory tree mirroring the changed part of the
    transaction.

  * To prepare for a commit, we perform the auto-merge, grab a write
    lock on the repository, marshal the changed-directory data onto
    the end of the changed-file file, and move the new file into
    place. (Alternatively, we could grab the write lock only at the
    very end. In this case, if someone else has slid in a commit
    while we were busy marshalling the directory data, we'll have to
    truncate the changed-file file, re-perform the auto-merge, and
    redo the marshalling. That would penalize large commits while
    benefitting small ones. It would also be more complicated, though
    not tremendously so. Perhaps it could be an option, though it
    would be such an arcane one that I wouldn't expect anyone who
    needs it to ever find it.)

I have, of course, thought about some alternative designs. Here is my
thinking about two of them:

  * We could have one file per node-rev. This would make it easier to
    do redeltification, but only at the cost of using lots and lots of
    inodes to represent a repository. The overhead would be
    particularly brutal for repositories full of many small files.

  * We could have one file per node, or per node-copy. That's sort of
    CVS-like. This also allows redeltification, but the process of
    committing becomes much more complicated, because you have to
    atomically replace a whole bunch of different node files. That
    means implementing transactions, which is a lot of complexity.

Anyway, comments appreciated. I'm also interested in comments about
how we ought to structure this work so that it doesn't disrupt the
main thrust of Subversion development.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Mar 30 05:52:03 2004

This message: [ Message body ]
Next message: Nicolás Lichtmaier: "I18n: The gettext proposal"
Previous message: Ben Collins-Sussman: "Re: Binary Releases"
Next in thread: Garrett Rooney: "Re: Towards a native-FS-backed filesystem"
Reply: Garrett Rooney: "Re: Towards a native-FS-backed filesystem"
Reply: Ben Collins-Sussman: "Re: Towards a native-FS-backed filesystem"
Reply: kfogel_at_collab.net: "Re: Towards a native-FS-backed filesystem"
Reply: Brandon Ehle: "Re: Towards a native-FS-backed filesystem"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]