Fwd: Huge repository

From: Troy Curtis Jr <troycurtisjr_at_gmail.com>
Date: 2006-08-23 02:24:44 CEST

---------- Forwarded message ----------
From: Garrett Rooney <rooneg@electricjellyfish.net>
Date: Aug 21, 2006 10:47 PM
Subject: Re: Huge repository
To: Troy Curtis Jr <troycurtisjr@gmail.com>

On 8/21/06, Troy Curtis Jr <troycurtisjr@gmail.com> wrote:
> On 8/21/06, Garrett Rooney <rooneg@electricjellyfish.net> wrote:
> > On 8/21/06, Zsolt <zkoppanylist@intland.com> wrote:
> > > Hi
> > >
> > > One of our customers consider to use SVN to mange a large code (180 GB) for
> > > a mid-medium size team. Few people with a lot of code, many variants from
> > > the code for diiferent "product" variants. The commercial CMS tool providers
> > > told us that this could be a problem for SVN.
> > >
> > > Does anybody using SVN with >=180GB repository? Can SNV handle this
> > > effectively for a team of 50 developers?
> > >
> > > Another aspect where he needs more confidence in SVN is the question if SVN
> > > supports a more stringent way to develop software. The ongoing development
> > > process looks like this: Small team, lot of code, many internal
> > > deliverables: some one checks out a file, does his work and checks-in again,
> > > the release is managed by labels. Another project can now use this release
> > > via a "use item". Typical question for him is: Is everything which was in
> > > work already checked in.
> > >
> > > Can SVN manage such process together with large repository?
> >
> > There is no intrinsic reason why Subversion would not be able to
> > manage that large a repository, although I would recommend avoiding
> > doing an initial import of all that data all in one commit. If you're
> > converting from some other system via a conversion utility like
> > cvs2svn you should be fine, it's simply that a single revision that
> > modifies thousands and thousands of files can be a performance
> > bottleneck due to authorization issues (i.e., if you run log over that
> > revision you have to check if users aren't allowed to see any of the
> > paths in question before you can show them the log message). A very
> > large initial import also risks problems related to very large
> > revision files (on fsfs anyway), but if you have an operating system
> > that supports large files and a new enough version of APR you should
> > be fine there.
> >
> > -garrett
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> > For additional commands, e-mail: users-help@subversion.tigris.org
> >
> >
>
> If a good portion of that repository size is the result of lots of
> commits, then I would highly recommend using the Berkeley Database
> backend. My project has a checkout sizeof about 150MB with a 2GB
> repository (converted from RCS to SVN via cvs2svn) and a checkout on
> with a fsfs backend is almost 6 minutes! Using bdb brought that down
> to 2.5 minutes. However, if you repository size is mostly due to
> having large files, then I would not think it would be any worse then
> the commercial ones. (Note: The problem with using fsfs with lots of
> commits is that subversion must traverse all the diffs back to the
> original version of the file, then apply all those diffs...for every
> file. It's time consuming. bdb rewrites the repository on commit to
> include a full copy of the latest version at the "top", with diffs
> going back into history)
>

> Uhh, you probably meant to send that to the list, and FWIW you're
> wrong about FSFS having to apply lots of diffs, it's using skip
> deltas, so it's log(N) numbers of diffs to apply, and really isn't a
> bottleneck even on files with huge numbers of revisions.

> -garrett

Oops, replied only to garrett on this message.

People keep saying this about fsfs not having the performance hit
because of the deltas, but it *is* a performance hit. When you have
lots of revisions (on the files themselves not just lots of repository
revisions.) on lots of files, and you have to traverse through all the
diffs for *that particular file* for each file you will take a hit!
At least compared to Berkely DB (actual usage on my repo is almost 6
minutes, 5 min 50 sec, for fsfs checkout and 2.5 minutes using bdb...I
would say doubling the time is a bottleneck!).

Obviously if you have 30000 revs with 10000 files (3 revs each), then
I expect that the whole skip deltas business does not create a
bottleneck. But when you have 500-600 files with 60-100 revs apiece,
then going through all those diffs will cost you! Of course, in order
to have complete HEAD version BDB must rewrite the repository at every
commit, so you increase your commit time and that could possibly be an
issue with large commits. But if your commits are typically
reasonably, but you have a lot of them over a long period of time
(large repo size and lots of revs) then BDB performance profile is
better.

Or it may be that my understanding of the fsfs diff storage and
retrieval is flawed. Regardless, using fsfs doubles my checkout time
and you can not tell me that it is not related to BDB design of having
a full HEAD rev copy and the fsfs need to travel back through each
file's individual diffs back to it's original version! Right?

-- 
"Beware of spyware. If you can, use the Firefox browser." - USA Today
Download now at http://getfirefox.com
Registered Linux User #354814 ( http://counter.li.org/)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Received on Wed Aug 23 02:26:04 2006

This message: [ Message body ]
Next message: Simon Roby: "Re: Source code lines counter"
Previous message: Nico Kadel-Garcia: "Re: Better approach for path-based authorization"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]