Ben Collins-Sussman wrote:
>
>A proposal for an svn filesystem dump/restore format. Questions are
>at the bottom.
>
>Two problems we want to solve
>=============================
>
> 1. When we change our node-id schema, we need to migrate all of our
> data (by dumping and restoring).
>
> 2. Serves as a backup format. Could be read by other software tools
> someday.
>
>
>Design Goals
>============
>
> A. Written as two new public functions in svn_fs.h. To be invoked
> by new 'svnadmin' subcommands.
>
> B. Format uses only timeless fs concepts.
>
> The dump format needs to reference concepts that we *know* are
> general enough to never change. These concepts must exist
> independently of any internal node-id schema, or any DB storage
> backend. In other words, we're talking about the basic ideas in
> our original "design spec" from May 2000.
>
>
>Format Semantics
>================
>
>Here are the timeless semantics of our fs design -- the things that
>would be stored in our dump format.
>
> - A filesystem is an array of trees.
> Each tree is called a "revision" and has unversioned properties attached.
>
I'd generalize this: a revision is a DAG, not a tree (think reference
nodes), maybe even a general directed graph. Although we can only create
trees at the moment, we know that limitation won't last.
> - A revision has a tree of "nodes" hanging off of it.
>
> - The majority of a tree's nodes are hard-links (references) to
> nodes that were created in earlier trees.
>
> - A node contains
>
> - versioned text
> - versioned properties
> - predecessor history: "which node am I a variant of?"
> - copy history: "which node am I a copy of?"
>
- type
Very necessary, imho. We have files and directores today, and can
predict (internal) references and (external) symlinks.
Predecessor and copy history should be merged, and generalized. Any node
can have any number of ancestors and descendants. The fact that we're
storing copy history separately right now is an artifact of the current
node id scheme, which encodes single-predecessor revision history.
(Which is nuts, but we're all aware of that now. :-) Node history is a
DAG, too, and should be represented as such.
> The history values can be non-existent (meaning the node is
> completely new), or can have a value of {revision, path}.
>
>
>Implementation (Questions)
>==============
>
> * file format
>
> Although it's tempting to use XML (easy to output, easy to write a
> parser), gstein pointed out that it may create more problems in
> the long run. Storing binary data (and escaping) in XML can be
> painful; scanning for the escape characters can really slow down
> an import; just imagine trying to store an XML file! Even though
> XML may be more convenient at the outset, we'll probably end up
> burning lots of time trying to work around these other issues
> later on.
>
> For this reason, we're thinking some kind of simple binary format.
>
What, tar? Zip? Shar, maybe? :-)
My vote would be to keep contents separate from structure, and encode
structure in XML. Similar to what we do in the tree diff, except the
inline base64'd content data.
> * should we bother to implement 'diffy' storage of texts in our
> format? My instinct is "no". Dumping and restoring a filesystem
> is a rare operation, so we don't need to be so paranoid about disk
> space usage. It would be extra work to implement diffy-storage,
> and imports would probably be safer (and faster) if we had nothing
> but fulltexts in our dump.
>
+1. KISS.
> * Reading through our 'libsvn_fs/structure' document, it seems that
> the only data we're not saving in the dump is a node's "Created
> Revision" (CR). It's not clear to me that this is a timeless
> concept. It certainly has no relevance in the new, impending fs
> schema. It seems more like an optimization for our current
> node-id schema. Do others agree?
>
+1. There's no reason to store stuff that can be generated on import.
What's more, requiring that the import is made in an empty repository
(therefore keeping the same global revision numbers) is a limitaion we
don't really need.
> Let me be clear here: the CR is still a useful concept. For
> example, I like very much that this value is cached in my working
> copy and shows up in 'svn status -v'. I will always want to know
> "in what revision did foo.c last change?" When we switch to the
> new schema, we'll still want to keep this concept around -- it
> will simply have a different implementation under the hood.
>
> But still, I don't think it needs to be saved in a dump format.
> When we re-import a filesystem, the information can be *derived*
> by whatever schema exists under the hood -- possibly on-the-fly,
> as we're importing. Am I making sense?
>
Right. No CRs in the dump format, they'd just cramp our style. :-)
--
Brane Čibej <brane_at_xbc.nu> http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Apr 24 00:12:27 2002