[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Severe performance issues with large directories

From: Paul Holden <paul.holden_at_gmail.com>
Date: Fri, 9 Apr 2010 12:31:30 +0100

Hello,

I’ve had a look through the issue tracker and mailing list archives and
didn’t find any references to this issue. I also assume that this is a more
appropriate mailing list than 'users'.

We’ve noticed recently that we have terrible performance when updating a
particular directory in our repository. We’ve realised that the poor
performance is related to the fact that we have 5,800 or so files in a
single directory. (I know! This is far from ideal but we’re a long way into
development and reorganising the directory structure at this stage is very
difficult.)

To give some concrete numbers, we recently re-exported about 10,000 texture
files (averaging about 40KB each, or 390MB) and 5,800 shaders (averaging
about 4KB each, or 22MB total). Both these files are gzip compressed. Here
are some approximate times for ‘svn up’

Textures: 10,000 files, 390MB, ~4 minutes

Shaders: 5,800 files, 22MB, ~10 minutes

The key point here is that the textures are nicely distributed in a
well-organised directory structure, but the shaders are dumped into a single
directory.

The problem we face now is that we're iterating as lot on the engine, which
is causing us to rebuild the shaders every day.

To cut a long story short, I ran SysInternals procmon.exe while svn was
updating, and saw two alarming behaviours:

1) .svn\entries is being read in its entirety (in 4kb chunks) for *every*
file that’s updated in the directory. As the shaders dir contains so many
files, it’s approximately 1MB in size. That’s 5,800 reads of a 1MB file
(5.8GB in total) for a single update! I know this file is likely to be
cached by the OS, but that’s still a lot of unnecessary system calls and
memory being copied around. Please excuse my ignorance if there's a
compelling reason to re-read this file multiple times, but can't subversion
cache the contents of this file when it's updating the directory? Presumably
it's locked the directory at this point, so it can be confident that the
contents of this file won't be changed externally?

2) subversion appears to generate a temporary file in .svn\prop-base\ for
every file that's being updated. It's generating filenames sequentially,
which means that when 5,800 files are being updated it ends up doing this:

file_open tempfile.tmp? Already exists!

file_open tempfile.2.tmp? Already exists!

file_open tempfile.3.tmp? Already exists!

...some time later

file_open tempfile.5800.tmp? Yes!

For N files in a directory, that means subversion ends up doing (n^2 + n)/2
calls to file_open. In our case that means it's testing for file existence
16,822,900 times (!) in order to do a full update. Even with just 100 files
in a directory that's 5,050 tests.

Is there any inherent reason these files need to be generated sequentially?
From reading the comments in 'svn_io_open_uniquely_named' it sounds like
these files are named sequentially for the benefit of people looking at
conflicts in their working directory. As these files are being generated
within the 'magic' .svn folder, is there any reason to number them
sequentially? Just calling rand() until there were no collisions would
probably give a huge increase in performance.

I appreciate that we're probably an edge case with ~6000 files, but it seems
that issue 2) is a relatively straightforward change which would yield clear
benefits even for more sane repositories (and across all platforms too).

In case it's relevant, I'm using the CollabNet build of subversion on
Windows 7 64bit. Here's 'svn --version':

C:\dev\CW_br2>svn --version

svn, version 1.6.6 (r40053)

   compiled Oct 19 2009, 09:36:48

Thanks,

Paul
Received on 2010-04-09 13:32:04 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.