[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Speeding up cvs2svn (was Re: cvs2svn takes very long time to execute (days!))

From: Roland Dreier <roland_at_topspin.com>
Date: 2004-02-15 20:08:57 CET

Hi Martin,

I've converted an even bigger repository than yours (it created a 10GB
Subversion dump file with something like 25K revisions). I had to do
several things to get cvs2svn.py to finish in a reasonable time (my
original run looked like it would take more than a month!).

I've been meaning to send this to the svn list for a while, since it
may prove useful to others. I'd also be interested in suggestions for
further improvements.

One warning: almost everything I did is fairly Linux-specific, so it
may not help if you are running some other operating system. Also,
having a lot of RAM will help performance.

I started by looking at what was taking all the time while running
cvs2svn.py. It seemed that all the time was spent accessing the
various .db files, so I first tried creating a tmpfs filesystem (you
can use a command like

    mount -t tmpfs -o size=3000M,mode=777 tmpfs /svntmp

to create the filesystem, and then run cvs2svn.py with /svntmp as its
working directory). You will need enough RAM so that the tmpfs doesn't start
swapping (or else you're back to the same old bad performance).

However, I found that this didn't help as much as I thought it would.
Surprisingly, cvs2svn.py was still only using a small fraction of the
CPU, even though it should have been almost CPU-bound (since nearly
all of the IO would be done to the tmpfs, which is in RAM). Using
strace, I found that cvs2svn.py was spending a lot of time sleeping
for 1 second using select().

I tracked this down to __memp_alloc() in the db library; when low on
memory, that function sleeps for a second to let other threads free
memory. (This sleep is present in at least versions 4.1 and 4.2 of
the db library) Of course this is completely useless in our case,
since cvs2svn.py is single threaded. I wrote an LD_PRELOAD-able
library (attached to this email) that turns 1 second select() sleeps
into NOPs. One you build select-preload.so, you can use it like
"LD_PRELOAD=<full path to select-preload.so> cvs2svn.py [cvs2svn.py
parameters]".

That helped a little, but it also made me investigate how much memory
the db library was using to cache the .db files. It turns out that db
files created using the python anydbm use a default cache size that is
far to small for the db files that cvs2svn.py creates for large
repositories (they may be 100s of MB). When I looked at the strace
output for cvs2svn.py, it seemed that the script was spending a lot of
time just shuffling blocks in and out of the db cache, without doing
much useful work.

Unfortunately there doesn't seem to be any way to specify a
non-default cache size using anydbm. Therefore I changed my copy of
cvs2svn.py to use bsddb -- in patch form:
    -import anydbm
    +import bsddb

I deleted the anydbm check (since I'm no longer using anydbm) and
added a cache_size variable:

    # Use 500MB db file caches (this uses a lot of RAM)
    cache_size = 500 * 1024 * 1024

I ran on a machine with 2GB of RAM. You may want to adjust cache_size
down if you have less. Then I changed all the anydbm.open() calls to
bsddb.hashopen() calls. For example:

    - self.nodes_db = anydbm.open(self.nodes_db_file, 'n')
    + self.nodes_db = bsddb.hashopen(self.nodes_db_file, 'n', 0666, 4096, 0, 1, cache_size)

With all of these changes, things went a lot faster and the whole
conversion took only a couple of days (instead of more than a month as
I initially thought).

As a side note, I'm not sure if it's really worth using the
abstraction of anydbm in the cvs2svn.py script. Subversion proper
pretty much requires a modern BSD db library to be installed, so I
don't think it's asking too much to require the bsddb3 python package
for cvs2svn.py.

There was one more improvement that I thought of but didn't get around
to coding or trying. Pass 2 (cleaning up the revision timestamps)
took a fairly long time, and the algorithm used seems suboptimal.
Perhaps some cvs2svn hackers can tell me if the following might work
(and run faster):

    - Add a new field to the pass 1 .revs output that has the resync
      (if any) for each file revision. Don't generate .resync.
    - In pass 2, sort the pass 1 output by log+author digest. Then
      go through the output, dealing with one log+author digest at a
      time, and update the time stamps.

This avoids having to create the whole resync hash table -- in my
case, the .resync file was 47 MB, and looking up commits in the hash
table seemed to take a long time.

Anyway, to summarize my findings on cvs2svn.py performance:

 1. Use tmpfs for a working directory to avoid disk IO.
 2. Use a LD_PRELOAD-ed library to get rid of useless sleeps in the
    BSD db library.
 3. Increase the db cache size to avoid shuffling blocks in and out of
    cache.
 4. It may be worth changing the pass 2 algorithm to increase
    performance.

I'd be very interested to hear any reactions to these ideas.

Best,
  Roland

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sun Feb 15 21:35:17 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.