[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Case study: Mono switches to Subversion

From: Ben Collins-Sussman <sussman_at_collab.net>
Date: 2004-11-17 18:16:38 CET

Case Study: the Mono project adopts Subversion
===============================================

This month, the Mono project switched to Subversion. It's an
important project, because (other than the ASF and Samba) it's
probably the biggest, highest-profile project to switch so far. My
sources tell me that Mono is also being viewed as a guinea pig for a
possible switchover of the entire GNOME project. So my eyes are wide
open.

Unfortunately, it seems like very few Mono folks had any experience
with Subversion at all, and very little reading or planning was done
in preparation. (Some Mono developers confirmed this.) The migration
was botched a couple of times, and there are some complaints now that
they've made the switch. I've been helping them quite a bit the last
couple of weeks.

I'm writing this case study because I want our own community to be
aware of the problems they had (and are having), so we can improve
Subversion, and be better prepared to help GNOME or other large
projects make the switch.

Why did they want to switch?
---------------------------

I'm told that the main attraction for Mono folks was offline diffing
and reverting, and possibly interested in rename support. This
project seems interested in working offline as much as possible.

Ultimately, they decided on an FSFS repository, and also to use
svn+ssh:// to fit in with their existing ssh infrastructure. They
also plan to use one of the many available mirror scripts to create an
anonymous read-only repository on a separate box, available via
svn://.

Migration problems
------------------

Apparently Mono has been fighting EOL-style problems for years in CVS.
They tell me that every once in a while, a win32 developer will
accidentally check in CRLF line endings. But apparently this isn't
the problem we've all heard of: it's not the entire file being
flip-flopped from LF to CRLF, but sometimes it's just selected *hunks*
of the file being changed to CRLF. The developer is spanked, but it's
not clear to me how (if at all) they remedied the problem in CVS when
it happened.

It's a bit of a mystery to me, since I thought that CVS effectively
treated all files as eol-style=native by default. I don't how or why
they had this problem at all, either on entire files or on chunks of
files. (?)

In any case, their first migration attempt was a no-go, because

     - they ran a dos2unix eol script over all their RCS files
         (thinking this would help the flip-flop problem)

     - they used an old cvs2svn script, which didn't set svn:eol-style
       on anything.

This resulted in a lot of strange, spurious working-copy diffs.

So I asked miguel to re-do the conversion, and this time he used
cvs2svn 1.1 on the original RCS files, and things came out better.
Instead of getting weird diffs on everything, they only now get weird
diffs on just a few files. These are the ones with CRLF 'chunks'
buried in them.

I suggested that they fix the problem by just correcting the chunks
and committing; but they don't want to do this, because it will "mess
up the ability to annotate." Apparently the idea of running 'svn
blame' twice instead of once is far, far worse than seeing spurious
diffs. This is probably because of their intense depenednce on 'svn
blame' (see next section.)

Still, there's a meta-problem here with cvs2svn:

When you set 'svn:eol-style=native' in an svn working copy and commit,
svn will -refuse- the commit if the file has mixed EOLs. That's
because in order for the 'native' feature to function properly, the
repository (and text-base) file must have LF line endings. So it
makes sense that the svn client enforces this.

But cvs2svn does no such enforcement. It happily sets the 'native'
property on anything without a -kb flag, without checking the contents
of the file. In theory, this is fine, because CVS is supposed to
store all non-kb files in LF. In practice, this obviously turns out
not to be fine. Apparently mixed line-endings are able to creep into
the RCS file anyway... somehow. And this problem isn't unique to
Mono: a converted ASF project just reported this same problem on our
users@ list yesterday.

In any case, this might be something worth fixing in cvs2svn.

A particular Mono developer and #svn person are currently working on a
script to run over the 4.7GB dumpfile and 'fix' all mixed line-endings
in non-binary files. Once they've gotten the bugs out of the script,
they'll probably re-do the conversion in a month. This solution is
acceptable, I'm told, because it fixes the spurious diffs throughout
all history, thus not messing up 'svn blame'.

Development patterns and culture clash
--------------------------------------

It took me a long time -- and many discussions -- to figure out why
there were complaints after the switchover. For speed, let me jump to
the various results of these talks.

*** Annotation dependency

The Mono community has come to rely *extremely* heavily on the 'cvs
annotate' command. Multiple developers tell me that the command is
run many times per day. And a lot of folks are unhappy with how much
slower 'svn blame' is, compared to 'cvs annotate'.

I was shocked at this idea at first -- I've run 'svn blame' maybe,
what? Three times in three years? One conversation went something
like this:

    Mono: "So, when you run across a line or chunk of code that
            doesn't make any sense, or seems wrong, what do you do?"

      Me: "What do *you* do?"

    Mono: "I run 'annotate', see who wrote the line and when, then run
           'log' and look at the commit message."

      Me: "You do this how often?"

    Mono: "Multiple times per day. What do you do?"

      Me: "This sounds strange, but I almost never see a chunk of code
            and wonder why it's there. It's a rare event. When it
            does happen, I ask my officemates about it, or the dev@
            list. Then I might run 'svn log' on the file and grep for
            the function name, to look at recent commits that might
            have changed the logic."

*** Cultures at different scales

The thing is, my response above is the result of working in a small
community where everyone knows all the code, all the committers, and
notices all the incoming changes. But it doesn't scale up to gigantic
projects like Mono or GNOME.

The Mono project is very different from our own project. We have
about 15 active developers, they have ~400 (and GNOME has ~500).
We've had perhaps 2 or 3 people leave the project, they've had dozens
and dozens. Their codebase is much, much larger than ours.

Yes, the ASF has switched many projects over -- but all of them
(including APR and httpd) still tend to be very small, focused groups
of developers, just like our own project.

Because of Mono's size, they a very different development culture.
People aren't able to notice and understand every commit. People join
and leave the project frequently, so a significant amount of code is
written by people who aren't on the dev@ list. Ergo, they're
constantly in the position to of needing to figure out "what happened"
in the code.

My prediction is that if we want Subversion to succeed with GNOME,
we'll need to figure out a way to make it much faster. A friend of
mine who is a core GNOME developer concurs.

In svn 1.0, 'blame' was about 100x slower than CVS. In svn 1.1, it's
about 10x slower. I really can't think of any way to make it faster,
other than doing what CVS does: keeping a cache of contextual diffs on
the server, so the server can instantly generate annotation. We
should probably start a different thread on this, to discuss (A) if we
want to open an issue for this, (B) if/how we want to prioritize this
enhancement at all.

*** The Changelog file

It's really common for a CVS project to keep a 'Changelog' file in the
repository. It's also common to mandate that it be prepended-to for
every commit, which is what both Mono and GNOME do. I believe that
the official Mono decree is that every developer *must* run

       cvs commit -m "`head -nX Changelog`"

At one point, I tried to explain that because Subversion has grouped
commits ("changesets" more or less, for some definition of the term),
that 'svn log' obviated the need for a manually-maintained Changelog.

But the two lead developers absolutely refuse to give it up; the
critical issue, I'm told, is that they want the whole Changelog to be
available offline at all times.

It's not clear to me if all the developers feel this way, and if not,
it seems odd that two developers would force 400 other people to do
extra work maintaining duplicate data in a file -- just so that they
don't have to remember to run 'svn log > Changelog' before they unplug
their network. (They already run 'svn up' before disconnecting,
right?)

Again, this may be another clash of cultures: if they're constantly in
the business of trying to figure out "what happened" in the code, then
having all log history available at all times (with minimal effort) is
critical.

The response from one of the two developers is: "until Subversion
learns to cache the whole log-history in a working copy, we're not
changing our policy."

I don't particularly view this request as a problem with Subversion; I
can't imagine how to design such a feature, or why we'd even bother.
At some point, you may as well just being using a decentralized VC
system. I suspect that this one developer won't be happy until he
starts using arch or svk (he's definitely hinted at wanting
distributed repositories).

So, no action item here -- just a mental note for future projects.
You may run into resistance trying to persuade them to drop
'Changelog'... especially really huge projects.

*** The stubborn developer

The only other "unhappiness" resulting from the conversion is that one
of the lead Mono developers refuses to change his habits.

Whenever he fixes a bug that affects N files, he does N separate
commits (or actually, a separate commit in each directory... I'm not
entirely clear on this). But the upshot is that file essentially gets
its own log message.

Other Mono devs asked him to start doing single commits, since this
means whole changesets will have a single revision name. But his
response was that such behavior "makes the output of 'svn log foo.c'
unreadable." When he runs log on foo.c, he only wants to see
comments about foo.c and absolutely nothing else.

I sent a diplomatic response to the Mono list, trying to explain that
the extra information wasn't "noise", but rather information he'd
eventually have to fetch anyway, once he identified the interesting
changeset. I tried to explain all the goodness that comes from having
a single revision number: how easy it is to port changes, close bugs,
revert changes, etc. But he doesn't want to budge.

I asked: when a developer needs to revert or port a change, what's the
procedure? How does he know what to port? His answer was that (1)
usually the same developer who writes the change does the port, so he
already knows which files are grouped together, or (2) people read the
commit-mail archives to figure out the grouping.

So this fellow is effectively preventing the other 400 developers from
getting any benefits from Subversion's changesets, because he refuses
to 'grep' or do an incremental search for symbols in his editor.
(Ironically, the Mono project uses the exact same log-message format
that we do, so it's incredibly easy to find symbols.) In any case,
it's his own choice, and it's a political fight within the Mono
community that I don't want to get into. They need to settle it
themselves. (I don't mean to defame this developer by the way; he's
a very smart, very nice guy. I'm just in no position to lecture him
on SCM best practices!)

Last I checked, someone wrote a C# program to parse 'svn log --xml'
and show only log message info about a single symbol. Maybe that will
be the tool they need to start doing single commits.

The moral of the story here? There are people out there who refuse to
leave the "every file is an island" universe. Maybe it's a holdover
from RCS days, I'm not sure. Be prepared to meet such people.

Conclusions
-----------

1. No project should ever jump into a new version control system
     without experimenting with it first, or assessing the general
     impact it will have on development policies. Switching VC systems
     is never "just" about learning the syntax of a new program -- it
     always involves re-evaluating and re-creating all of your
     project's procedures. Do the research, and assess the impact
     before jumping off the cliff.

2. For Mono in particular, it's not clear to me that there's been any
     net benefit in switching to Subversion. Of the two commands they
     use constantly, one is much faster (diff), and one is much slower
     (blame). They're not getting any benefits from global revision
     numbers. They're still dup'ing data into Changelog file. And
     it's not clear if copies/renames are things they care about. I
     guess we'd have to interview them to find out what they think.

3. We might want to have cvs2svn verify line endings before setting
     svn:eol-style=native.

4. In general, I think our own view of version control has become a
     bit skewed toward small communities. If we want projects like
     GNOME to switch, we may have to investigate some time in
     redesigning our 'svn blame'.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Nov 17 18:17:08 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.