I have just finished writing a full parser for Subversion dumpfiles.
The next release of reposurgeon will have the ability to read them
directly, though not to write them.
In the process, I've looked very closely at the file
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
and discovered a number of problems with it. I have commit
privileges on the Subversion repo; I was given them in connection
with svncutter. I'm willing to fix up that file, but want to check
that I wouldn't be stepping on any toes by doing so.
My notes on the format follow for review by whoever is the responsible
maintainer. Please look in particular at the sections bracketed with
[? and ?].
# The Subversion dumpfile format is documented at
#
# https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
#
# but there are a number of points on which that document is incomplete or
# vague. The following notes fill in some gaps and document the assumptions
# on which reposurgeon's code is based.
#
# Below, [? ?] flags assertions I am using but relatively unsure
# of. These need to be checked further
#
# First, syntax. It is implied, but not expressed, that Revision and Node
# records are in an RFC822-like format - headers followed by a spacer line
# followed by a body. A Revision record begins with a Revision-number
# line; a Node record begins with a Node-path line.
#
# Each header normally ends with a Content-Length line giving the
# length of the record body in chars *excluding the spacer line*.
# But some records can never have a body and thus have no content
# length. A node describing a copy operation ends with a Node-copyfrom-path
# line and has no content. A node describing a delete action ends with the
# Node-action line and has no content. Each of these records must still be
# followed by a spacer line. [?These are the only records that can end
# without a Conent-Length line.?]
#
# The body of a Revision record consists entirely of a property
# section. The body of a Node record consists of an optional property
# section followed by an optional text section (one of the two will
# always be present, otherwise the node would be a no-op). When a
# properties section is present, its portion of the record length is
# given by a Prop-content-length header. When a text section is
# present, its portion of the record length is given by a
# Text-content-length header. A property-section is always terminated
# with PROPS-END\n; the length of that terminator is *included* in the
# Prop-content-length.
#
# A properties section consists of a sequence of paired K and V (key and
# value) records. The header of each record is a body length. The body
# begins on the next line and is an uninterpreted byte stream of the
# specified length. A spacer \n is always inserted after the body
# so the next K or V record (or the terminator) will begin at the start
# of a text line. The last line is always PROPS-END\n.
#
# The Properties section of a Revision record consists of some subset
# of the three reserved per-commit properties: svn:author, svn:date,
# and svn.log. Because a Revision record has no text follows that the
# lengths given in Prop-content-length and Content-length are always
# the same.
#
# Then, semantics. The three areas where the existing documentation
# is somewhat vague are (a) the persistence of properties, and in
# particular how to delete them, (b) the meaning of the actions ("change",
# "add", "delete", "replace"), and interpretation of (c) copypath/copyrev
# properties.
#
# The key thing to know about properties is that the format re-lists
# the entire property set (after modification) for a directory or file
# in every node record that changes either property or text for that
# file.
#
# This implies that to delete a given property from a path, a dumpfile
# generator will issue a node with all other properties listed in it;
# to delete all properties from a path, the dumpfile generator will
# simply issue a node with an empty properties section. Note that this
# is different from an *absent* properties section, which will change
# no properties and will be associated with a change to content!
#
# Text sections work the same way. When present, a text section on a
# file node changes the contents of the file; an absent text section
# means only the file properties change.
#
# The "add" action is used to add new directories and file content.
# Directory adds never have text content; file adds always do. Either type
# may have properties [?but the Subversion client tools never generate
# an add node with properties?].
#
# The "change" action changes text or properties or both. It may also
# be used on a directory copy, meaning that the contents of the copy
# should add to and not replace the contents of the target directory.
#
# The "delete" action removes the path and never has properties, as
# they would vanish along with the path.
#
# The "replace" action [?is only issued with directory copies, and?]
# signifies that the existing contents of the directory should be
# removed before the copy.
#
# Interpreting copyfrom_path for file copies is straightforward; the
# target pathname gets the contents of the source pathname.
#
# Directory copies (the primitive beneath branching and tagging) are
# tricky. For each source path under the source directory, a new path
# is generated by removing the head segment of the pathname that is
# the source directory. That new path under the target directory gets
# the content of the source path.
#
# A single revision may include multiple copyfrom nodes, even multiple
# copyfroms to the same directory, even mixed directory and file copies
# to the same directory; [?Subversion client tools never generate such
# mixed copies, but?] I have seen the results of cvs2svn doing it.
#
# Note: The Subversion notes show a Node record always ending with
# a Content-length header. This is erroneous (node records can end with
# a Node-copyfrom-path or Node-action line) and may represent a bug.
--
Eric S. Raymond
The spirit of resistance to government is so valuable on certain occasions,
that I wish it always to be kept alive. It will often be exercised when
wrong, but better so than not to be exercised at all. I like a little
rebellion now and then. -- Thomas Jefferson, letter to Abigail Adams, 1787
Received on 2011-12-13 08:14:08 CET