This is a backgrounder on why I'm updating notes/dump-load-format.txt,
because I think the Subversion crew ought to know.
Some of you may remember svncutter, the Python tool I wrote for slicing and
dicing Subversion dump streams that used to live in your contrib
Back in 2010 it begat reposurgeon, which is how I ended up trying to
fully document dump streams - I needed that as a spec for
reposurgeon's dump stream reader, which was much more ambitious than
svncutter's. It's still the only one outside of Subversion itself
that handles branch and tag semantics in full generality.
I eventually yanked svncutter out of your contrib directory and added
it to the reposurgeon distribution under the name "repocutter". I had
previously thought that reposurgeon made svncutter obsolete, but it
turns out that a specialized tool for slicing projects out of a
multi-project svn repository still has a use case. On *very large*
multiproject repositories - repocutter only processes them one commit
as a time, so it gets away with a much smaller working set than
reposurgeon requires to deserialize the whole repository prior to
slicing it up.
More recently I hit a performance wall while trying to convert the GCC
repository, which is monstrously huge - 359K commits. This brought
even my semi-specialized Great Beast hardware to its knees; 9 hour
test cycles really suck. And I'd already done the hunt-down-and-kill
on O(n**2) internal algorithms during the Emacs repository conversion
back around 2013.
With no good alternatives left, I began moving the reposurgeon suite
from Python to Go. The minor tools, including repocutter, are now
done and verified; reposurgeon itself is in progress at about 75%
done. While the semantic gap between Python and Go is much smaller
than you might expect given the taxonomic differences between the
languages, translating 14KLOC of algorithmically dense code would be
rather an epic under even the best circumstances.
As expected, Go's tight machine code is good for at least an order of
magnitude speedup over Python's notoriously high interpretive overhead
- probably more on larger repos, but I don't have actual figures on
There's a wrinkle, though. Two, actually.
One is that I've lost one crucial piece of Python reposurgeon, an
implementation of copy-on-write storage that proved impossible to
translate out of a duck-typed, late-binding language into a
statically-typed early-binding one. And, you guessed it, that
hole is smack in the middle of my dump-stream reader.
The other is that my stream reader still has obscure bugs where
its interpretation of stream files does not quite match that of the
black-box code inside Subversion. These correspond exactly to the
cases where the intended stream semantics is still poorly documented,
around directory copies and flow boundaries.
What it comes down to it is that after I get the rest of the Go
translation done and verified (I have a *really good* test suite) I'm
going to have to tear apart and rebuild the dump stream reader.
That's when you'll get updates nailing down the vague bits and most of
the unanswered questions in notes/dump-load-format.txt. Because the
easiest and best way for me to understand what I learn by experiment
is to write it down there.
Eric S. Raymond
Hoplophobia (n.): The irrational fear of weapons, correctly described by
Freud as "a sign of emotional and sexual immaturity". Hoplophobia, like
homophobia, is a displacement symptom; hoplophobes fear their own
"forbidden" feelings and urges to commit violence. This would be
harmless, except that they project these feelings onto others. The
sequelae of this neurosis include irrational and dangerous behaviors
such as passing "gun-control" laws and trashing the Constitution.
Received on 2018-10-29 02:02:20 CET