= How to interpret Subversion dumpfiles = Eric S. Raymond v0.2, 2011-12-14 == Introduction == The Subversion dumpfile format was first documented at https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt but there are a number of points on which that document is incomplete or vague. This document incorporates those notes, expands them, and is intended to replace them. The goal is that it be sufficient for people writing dumpfile interpreters to emulate the actions the dumpfile describes on a versioned filesystem-like store, such as another version-control system. Open questions: 1. Is delete on a directory with children expected to succeed or fail? 2. Is replace on a nonexistent path expected ro fail? 3. Is add on an existing path expected to fail? 4. What are the detailed semantics of "change" with copyfrom? I understand the no-copyfrom case - on a file it's an ordnary text modification, on a directory a pure property change. == Syntax == === Encoding and delimiters === Subversion dumpfiles are plain byte streams. The structural parts are ASCII. Text sections and property key/value pairs may be interpreted as binary data in any encoding by client tools. A dumpfile consists of four kinds of records. A record is a group of RFC822-style header lines (each consisting of a key, followed by a colon, followed by text data to end of line), followed by an empty spacer line, followed optionally by a body section. If the body section is present, another empty spacer line separates it from the following record. For forward compatibility, unrecognized headers are ignored. === Record types === Dumpfiles include four record types. Two, the version stamp and UUID record, consist of single header lines. The bulk of a dumpfile consists of Revision and Node records. A version stamp record is always the first line of the file and looks like this: ------------------------------------------------------------------- SVN-fs-dump-format-version: \n ------------------------------------------------------------------- where is replaced by the dump format version. Except where specified, the descriptions in this document aapply to all versions of the format. Versions 2 and later may have a UUID record following the version stamp. It is of the form ------------------------------------------------------------------- UUID: ------------------------------------------------------------------- where the is the UUID of the originating repository. An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e". A Revision record has three headers and is always followed by a property section. Expect the following form and sequence: ------------------------------------------------------------------- Revision-number: Prop-content-length:

Content-length: ! ------------------------------------------------------------------- with the Revision-number header always first and the '!' indicating a mandatory empty spacer line.

gives the length in bytes of the following property section. gives the body length of the entire Revision record. These two numbers will be *identical* for a Revision record; the Content-length header is added for the benefit of software that can parse RFC-822 messages. A revision record is followed by one or more Node records (see below). === Property sections == A Revision record *must* have a property section, and a Node record *may* have a property section. Every record with a property section has a Prop-content-length header. A property section consists of pairs of key and value records and is ended by a fixed trailer. Here is an example attached to a Revision record: ------------------------------------------------------------------- Revision-number: 1422 Prop-content-length: 80 Content-length: 80 K 6 author V 7 sussman K 3 log V 33 Added two files, changed a third. PROPS-END ------------------------------------------------------------------- The fixed trailer is "PROPS-END\n" and its length is included in the Prop-content-length. Before it, each K and V record consists of a header line giving the length of the key or value content in bytes. The content follows. The content is itself always followed by \n. In version 3 of the format, a third type 'D' of property record is introduced to describe property deletion. This feature will be described later, in the specification of delta dumps. === Node records === Each Revision record is followed by one or more Node records. Node records have the following sequence of header lines: ------------------------------------------------------------------- Node-path: path/to/node/in/filesystem [Node-kind: {file | dir}] Node-action: {change | add | delete | replace} [Node-copyfrom-rev: ] [Node-copyfrom-path: ] [Text-copy-source-md5: ] [Text-content-md5: ] [Text-content-length: ] [Prop-content-length:

] [Content-length: Y] ! ------------------------------------------------------------------- Dump decoders should be prepared for the optional lines after Node-action to be in any order, except that Content-length is always last if it present. A Node record describes an action on a path relative to the repository root, and always begins with the Node-path specification. The Node-kind line indicates whether the path is a file or directory. It may be (and usually is) absent if the node action is a delete. The Node-action line is always present and specifies the type of operation for this node. The operations will be described in detail later in this document. Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be present, or neither will be. They pair to describe a copy source for the node; copy-source semantics will be described in detail later in this document. The Text-content-md5 and Text-copy-source-md5 lines are hash integety checks and will be present only if Text-content-length and the copfyrom pair (respectively) are also present. A decoder may use them to verify that the source content they refer to has not been corrupted. Text-content-length will be present only when there is a text section. Zero is a legal value for this length, indicating an empty file. Prop-content-length will be present only when there is a properties section. Content-length will be present if there is either a text or a properties section. This is not always the case. In particular, a delete operation cannot have either. Some other operations that use copyfrom sources may also not have either. Again, the '!' stands in for a mandatory empty line following the RFC822-style headers. A body may follow == Semantics == === The kinds of things === There are four kinds of things described by a dumpfile: paths, properties, content, and flows. The distinctions among content, paths, and flows matter for understanding some operations. A path is a filesystem location (a file or directory). There are two kinds of paths in a dumpfile; node paths and copy sources. Properties are key-value pairs associated with revisions or paths. Subversion interprets and reserves some properties, those beginning with "svn:". Others are not interpreted by Subversion; they may may be set and read for the convenience of other applications, such as repository browsers or translators. A flow is a sequence of actions on a file or directory path that is considered to be a single history for change-tracking purposes. Creating a flow tells Subversion that you want to track the history of the path or paths it contains. Destroying a flow breaks the chain of history; changes will not be tracked across the break, even if another flow is created at the same path. Content is what file paths point at (one timewise slice of a flow). It is the payload of program source code, documents, images, and so forth that a version control system actually manages. A node describes a change in properties, the addition or deletion of a flow, or a change in content. It nust do at least one of these things, otherwise it would be a no-op and omitted. === The kinds of operations === .File operations |====================================================================== | | add | delete | replace | change | |Can have text section? | optional | no | optional | optional | |Can have property section? | optional | no | optional | optional | |Can have copy source? | optional | no | optional | optional | |====================================================================== .Directory operations |====================================================================== | | add | delete | replace | change | |Can have text section? | no | no | no | no | |Can have property section? | optional | no | optional | required | |Can have copy source? | optional | no | optional | no | |====================================================================== A node represents an operation that does one of four things: add, delete, change, or replace. Nodes can carry content in one (or both!) of two ways: from a text section or from a copy source (that is, a copy-path and copy-revision pair). Giving a copy source appends the node to the flow of which that source is part; when you 'add', or 'replace' with a copy source, the content at the path becomes a copy of the source (but see below for a qualification about directories). [See the open question about the semantics of "change"] Giving a text section also changes the content of the flow. In the (unusual) case that a node has both a copy source and a text section, the correct semantics is to attach the path to the source flow and then change the content. An add operation creates a new flow for a file or directory. If a flow already exists at that path, the operation fails The initial content for the new flow may come from a text section attached to the node, or from a copy source. A node representing an add operation may have a property section. Directory adds never have text content, but may have a copyfrom source; file adds always have one or the other, but not both. A delete operation deletes a flow and its content. If a flow does not exist at that path, the operation fails. If a flow does not exist at that path, the operation fails. A delete node may not have text, properties, or a copy source. A subsequent add at the same path will create a new and different flow with its own history. A change operation changes properties on a file or directory path. If a flow does not exist at that path, the operation fails. A change node must have properties, and may not have text or a copy source. A replace operation behaves exactly like a delete followed by an add (destroying an old flow, producing a new one) when it has no copy source. When a replace has a copy source, it produces a new flow with history extending back through the copy source. A node representing a replace operation may have a property section. The main reason "replace exists" is because it helps sequential processors of the dump stream avoid possibly notifying about multiple actions on the same path. It is even possible to have a replace with a copyfrom source *and* text, such as would result from this on the client side: ------------------------------------------------------------------- $ svn rm dir/file.txt $ svn cp otherdir/otherfile.txt dir/file.txt $ echo "Replacement text" > dir/file.txt $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and replace its text, too." ------------------------------------------------------------------- === Some details about copyfroms === Interpreting copyfrom_path for file copies is straightforward; the target pathname gets the contents of the source pathname. Directory copies (the primitive beneath branching and tagging) are tricky. For each source path under the source directory, a new path is generated by removing the head segment of the pathname that is the source directory. That new path under the target directory gets the content of the source path. After this operation: ------------------------------------------------------------------- Node-path: x/y/z Node-kind: dir Node-action: add Node-copyfrom-rev: 10 Node-copyfrom-path: a/b/c ------------------------------------------------------------------- the file a/b/c/d will have been be copied to x/y/z/d. A single revision may include multiple copyfrom nodes, even multiple copyfroms to the same directory, even mixed directory and file copies to the same directory. === Properties and persistence === The properties section of a Revision record consists of some subset of the three reserved per-commit properties: svn:author, svn:date, and svn.log. These properties do not persist to later revisions. The key thing to know about Node properties is that they are persistent, once set, until modified by a future property section on the same path. Normally, a dumpfile re-lists the entire property set for a directory or file in every node record that changes any part of it. (But see the material on delta dumps for an exception.) This implies that to delete a given property from a path, a dumpfile generator will issue a node with all other properties listed in it; to delete all properties from a path, the dumpfile generator will simply issue a node with an empty properties section. Note that this is different from an *absent* properties section, which will change no properties and will be associated with a change to content! == An example == Here's an example of revision 1422, which added a new directory "baz", added a new file "bop" inside it, and modified the file "foo.c": ------------------------------------------------------------------- Revision-number: 1422 Prop-content-length: 80 Content-length: 80 K 6 author V 7 sussman K 3 log V 33 Added two files, changed a third. PROPS-END Node-path: bar/baz Node-kind: dir Node-action: add Prop-content-length: 35 Content-length: 35 K 10 svn:ignore V 4 TAGS PROPS-END Node-path: bar/baz/bop Node-kind: file Node-action: add Prop-content-length: 76 Text-content-length: 54 Content-length: 130 K 14 svn:executable V 2 on K 12 svn:keywords V 15 LastChangedDate PROPS-END Here is the text of the newly added 'bop' file. Whee. Node-path: bar/foo.c Node-kind: file Node-action: change Text-content-length: 102 Content-length: 102 Here is the fulltext of my change to an existing /bar/foo.c. Notice that this file has no properties. ------------------------------------------------------------------- == Format variants == === Version 3 format === Version 3 format is a delta dump; text changes are represented as diffs against the original file, and properties as incremental changes to a persistent set (that is, a property section does not necessarily implcitly clear the property set on a path before the new property settings are evaluated). This change is a space optimization. It requires additional computing time to integrate the diff history. Version 3 is enerated by SVN versions 1.1.0-present, if requested by the user. This format is equivalent to the VERSION 2 format except for the following: 1. The format starts with the new version number of the dump format ("SVN-fs-dump-format-version: 3\n"). 2. There are several new optional headers for node changes: ------------------------------------------------------------------- [Text-delta: true|false] [Prop-delta: true|false] [Text-delta-base-md5: blob] [Text-delta-base-sha1: blob] [Text-copy-source-sha1: blob] [Text-content-sha1: blob] ------------------------------------------------------------------- The default value for the boolean headers is "false". If the value is set to "true", then the text and property contents will be treated as deltas against the previous contents of the node (as determined by copy history for adds with history, or by the value in the previous revision for changes--just as with commits). Property deltas have the same format as regular property lists except that (1) properties with the same value as in the previous contents of the node are not printed, and (2) deleted properties will be written out as D just as a regular property is printed, but with the "K " changed to a "D " and with no value part. Text deltas are written out as a series of svndiff0 windows. If Text-delta-base-md5 is provided, it is the checksum of the base to which the text delta is applied; note that older versions (pre-1.5) of 'svnadmin load' may ignore the checksum. Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not currently used by the loader. They are written by 1.6-and-later versions of Subversion so that future loaders can optionally choose which checksum to use for checking for corruption. === Archaic version 1 format === There are actually two types of version 1 dump streams. The regular ones are generated since r2634 (svn 0.14.0). Older ones also claim to be version 1, but miss the Props-content-length and Text-content-length fields in the block header. In those days there *always* was a properties block. This note is included for historical completness only, at is it highly unlikely that any Subversion instances that old remain in production. == Credits == This derives in part from an earlier document by Ben Sussman. [Designers of the format should be credited here.] == Ancient history == Old discussion: (This file started as a proposal, preserved here for posterity.) A proposal for an svn filesystem dump/restore format. === Two problems we want to solve === 1. When we change our node-id schema, we need to migrate all of our data (by dumping and restoring). 2. Serves as a backup format. Could be read by other software tools someday. === Design Goals === A. Written as two new public functions in svn_fs.h. To be invoked by new 'svnadmin' subcommands. B. Format uses only timeless fs concepts. The dump format needs to reference concepts that we *know* are general enough to never change. These concepts must exist independently of any internal node-id schema, or any DB storage backend. In other words, we're talking about the basic ideas in our original "design spec" from May 2000. === Format Semantics === Here are the timeless semantics of our fs design -- the things that would be stored in our dump format. - A filesystem is an array of trees. Each tree is called a "revision" and has unversioned properties attached. - A revision has a tree of "nodes" hanging off of it. Actually, the nodes in the filesystem form a DAG. A revision always points to an initial node that represents the 'root' of some tree. - The majority of a tree's nodes are hard-links (references) to nodes that were created in earlier trees. - A node contains - versioned text - versioned properties - predecessor history: "which node am I a variant of?" - copy history: "which node am I a copy of?" The history values can be non-existent (meaning the node is completely new), or can have a value of {revision, path}. === Refinement of proposal #2: === (after discussion with gstein) Each node starts with RFC822-style headers at the top. The final header is a 'Content-length:', followed by the content, so record boundaries can be inferred. The content section has two implicit parts: a property hash, and the fulltext. The division between these two sections is implied by the "PROPS-END\n" tag at the end of the prophash. In the case of a directory node or a revision, only the prophash is present. //End of document.