[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Can Subversion work well with very large data sets?

From: Stefan Sperling <stsp_at_elego.de>
Date: Sat, 9 Oct 2010 23:58:06 +0200

On Sat, Oct 09, 2010 at 02:24:00PM -0700, Robert Rohde wrote:
> Hello,
>
> I am trying to identify a reasonable version control system for an
> unusual workflow. As SVN is a major player in this space, it is one
> of the systems that I want to consider but I've run into some
> problems. It is unclear to me whether my problems are specific to the
> SVN clients I have tested or whether they are a general consequence of
> the way SVN has been designed.
>
> I would appreciate feedback on whether there are ways to make SVN work
> more effectively for my project, or in the alternative whether there
> are other version control systems that might be more suitable.
>
> Workflow Specifications:
>
> * ~1 million files under version control ( > 99% are essentially text
> files, but a few are binary )
> * Average file size 50 kB, for a total archive of 50 GB.
> * Wide range of sizes, ~50% of files are less than 10 kB, but a couple
> are greater than 1 GB.
> * Most updates occur through a batch process that changes ~10% of
> files every two weeks. (Not the same 10% every time.)
> * Typically batch changes modify only a few percent of each file, so
> total difference during batch update is only ~200 MB.
>
> Other Requirements:
>
> * Must support random file / version access.
> * Clients must run on Windows and Linux / Mac
> * Must allow for web based repository viewing.
> * Highly desirable to allow for partial checkout of subdirectories.
>
>
> In my testing, SVN clients seem to behave badly when you throw very
> large numbers of files at them. TortoiseSVN, for example, can take
> hours for a relatively simple add operation on data samples a fraction
> of the total intended size. Another of the SVN clients I tested (but
> won't bother naming) crashed outright when asked to work with 30000
> files.
>
> Are there ways to use SVN in conjunction with very large data sets
> that would improve its performance? For example alternative clients
> that might be better optimized for this workflow? I'd even consider
> recompiling a client if there was a simple way to find significant
> improvements.
>
> My worry is that SVN may be designed in such a way that it is always
> going to perform poorly on a data set like mine. For example, by
> requiring lots of additional file i/o to maintain all its records. Is
> that the case? If so, I would appreciate any recommendations for
> other version control systems that might be better tailored to working
> with very large data sets.
>
> Thank you for your assistance.
>
> -Robert A. Rohde

It sounds like your main problem is local operations in the working
copy (i.e. disk i/o).
Could you do another evaluation of Subversion when we start issuing
beta releases of Subversion 1.7 probably at the end of this year?
We're trying to address scalability problems in the 1.6 working copy
implementation in 1.7 -- the entire working copy handling code has been
rewritten for this and other purposes. Furhter performance improvements
are planned for 1.8 and later (some have already been implemented on a
branch that's going to be completely reviewed and merged after 1.7 release).

For 1.6, you should really consider running the client on Linux.
Subversion has been reported to be up to 10 times faster on Linux
than on Windows.

In any case, the standard "svn" command line client is the best point
of reference when it comes to performance evaluations.

You should also try Mercurial if you haven't already. It should match
your criteria no worse than Subversion does.

Thanks,
Stefan
Received on 2010-10-09 23:59:03 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.