[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Problems with the documentation of Subversion dump format

From: Eric S. Raymond <esr_at_thyrsus.com>
Date: Tue, 13 Dec 2011 16:24:00 -0500

Daniel Shahaf <d.s_at_daniel.shahaf.name>:
> Eric S. Raymond wrote on Tue, Dec 13, 2011 at 13:34:03 -0500:
> > Self-defense, I assure you. I'm attempting to build a better SVN-to-DVCS
> > converter than exists anywhere now, and the best way to understand the
> > dump format well enough to do that is to document it in detail.
> Curious if you also intend to support X-to-SVN conversion in your tool.

Probably not, though I have considered it and might change my mind.

Here's what's going on. I've explained reposurgeon here before; to
recap, it uses the fact that lots of DVCSes can speak git import
streams to act as a common editor for all of them. Essentially, it
says to the VCS's exporter "give me a stream dump", deserializes the
result, you edit, then it re-serializes the new state and feeds it to
an importer.

This works nicely for git and hg; a little less well, though
acceptably, for bzr. bzr's problem is that it's confused
about whether its unit of work is a whole repo or a sort of
detached branch thingy; its importer think one thing, its
exporter thinks another, and some irritations ensue.

I've wanted to teach reposurgeon to speak svn for a while now. The
problem is, I've looked at a half-dozen exporters from svn to git
import streams, and they all *suck*. It seems like everybody gets to
about the same point - just before doing the analysis to map svn
branches to git/hg-like branches - and gives up. And you can't do
that - in gitspace, if you don't have a theory of what branches are,
you can't get the parent/child relationships right. In svnspace, you
can't even detect tags properly.

Most of these tools (Daniel Barr's svfn-fe, Gustavo Niemeyer's
svn2git, Chris Lee's svn-fast-export.py, a couple random svn2gits on
github and gitorious, some others I've forgotten now) *only work on
linear repos* - they've all got shamefaced comments saying branches
aren't handled yet.

There are only two exception I know of to the lossage. One has you
writing complicated rules in a minilanguage to define the branch
mapping: equal lossage, other direction. The other is git-svn, which
does a reasonable job on repositories close to standard layout if you
hint at it right, but is really designed for live gatewaying rather than
conversions. Among other things it doesn't lift tags.

When I first shipped reposurgeon, I prodded this list to solve the
problem - have an official exporter. That didn't happen, so I decided
to solve the problem from my end. Wrote a zero-configuration
branch-mapping algorithm that should works for 99% of cases and punts
to something usable on the other 1%. Got it to lift svn tags to real
git tags. I'm ahead of the pack already.

The only reason I haven't shipped yet is that some weird things
cvs2svn generates give my dumpfile importer indigestion; I'm working
on that, it's the exact reason I need to understand the format

(I should mention here that I tried a different approach first -
befire I wrote the dumpfile parser I was scraping svn repos with a
harness wrapped around the Subversion CLI tools, sort of a replay
attack. Had to abandon that because it was *hideously* slow - over 8
hours to suck in a repo with around 3Kcommits. And yes, that was the
CLI tools being poky; the stream parser takes about 8 minutes on the
same repo.)

There is really only one even moderately hard problem here, and that
is the branch mapping. Once you beat that, up-conversion from svn to
a DVCS works very nicely. You're adding information, not losing it.
(There's one minor exception; Subversion's user-set properties don't
map well to plain git-import streams. You need the bzr properties
extension for that, which git itself chokes on.)

On the other hand, when you *start* in gitspace, mapping back down to
the set of abstractions svn can handle is really lossy. You have to
throw away the domains on committer names, all the author fields, real
(annotated) tags, and branch merges. DVCS merges don't really map to
Subversion merges at all well; the svn version is more like what git/hg
folks call cherry-picking.

So, if I were to suport writing svn dumpfiles, it would throw away so
much information from import streams that the result would be pathetic.
Functionally, the worst loss would be real branch merges. That is
a showstopper, right there.

There's only use case for which the capability to write svn repos from
reposurgeon would make sense, and it's not conversions. It's
Subversion-to-Subversion repository editing. Which does tempt me a

		Eric S. Raymond
Received on 2011-12-13 22:25:06 CET

This is an archived mail posted to the Subversion Dev mailing list.