RE: Severe performance issues with large directories

From: Bert Huijben <bert_at_qqmail.nl>
Date: Fri, 9 Apr 2010 14:27:00 +0200

> -----Original Message-----
> From: Paul Holden [mailto:paul.holden_at_gmail.com]
> Sent: vrijdag 9 april 2010 13:32
> To: dev_at_subversion.apache.org
> Subject: Severe performance issues with large directories
>
> Hello,
>
>
>
> I've had a look through the issue tracker and mailing list archives and
> didn't find any references to this issue. I also assume that this is a
more
> appropriate mailing list than 'users'.

I think you can find a lot of issues similar to your issue in our issue
tracker.

For Subversion 1.7 we are rewriting the entire working copy library to use a
database which should resolve most of your issues. (And which will allow us
to resolve more issues in future versions). The issues related to this
rewrite have the 'WC-NG' name somewhere.

> We've noticed recently that we have terrible performance when updating a
> particular directory in our repository. We've realised that the poor
> performance is related to the fact that we have 5,800 or so files in a
> single directory. (I know! This is far from ideal but we're a long way
into
> development and reorganising the directory structure at this stage is very
> difficult.)
>
That is certainly a number of files where the entries handling in the
current wc library will be slow. (Confirming your findings later in your
mail)

> To give some concrete numbers, we recently re-exported about 10,000
> texture
> files (averaging about 40KB each, or 390MB) and 5,800 shaders (averaging
> about 4KB each, or 22MB total). Both these files are gzip compressed. Here
> are some approximate times for 'svn up'
>
>
>
> Textures: 10,000 files, 390MB, ~4 minutes
>
> Shaders: 5,800 files, 22MB, ~10 minutes
>
>
>
> The key point here is that the textures are nicely distributed in a
> well-organised directory structure, but the shaders are dumped into a
single
> directory.
>
>
>
> The problem we face now is that we're iterating as lot on the engine,
which
> is causing us to rebuild the shaders every day.
>
>
>
> To cut a long story short, I ran SysInternals procmon.exe while svn was
> updating, and saw two alarming behaviours:
>
>
>
> 1) .svn\entries is being read in its entirety (in 4kb chunks) for *every*
> file that's updated in the directory. As the shaders dir contains so many
> files, it's approximately 1MB in size. That's 5,800 reads of a 1MB file
> (5.8GB in total) for a single update! I know this file is likely to be
> cached by the OS, but that's still a lot of unnecessary system calls and
> memory being copied around. Please excuse my ignorance if there's a
> compelling reason to re-read this file multiple times, but can't
subversion
> cache the contents of this file when it's updating the directory?
Presumably
> it's locked the directory at this point, so it can be confident that the
> contents of this file won't be changed externally?

For WC-NG we move all the entries data in a single wc.db file in a .svn
directory below the root of your working copy. This database is accessed via
SQLite, so it doesn't need the chunked rewriting or anything of that. (It
even has in-memory caching and transaction handling, so we don't have to do
that in Subversion itself any more)

> 2) subversion appears to generate a temporary file in .svn\prop-base\ for
> every file that's being updated. It's generating filenames sequentially,
> which means that when 5,800 files are being updated it ends up doing this:
>
>
>
> file_open tempfile.tmp? Already exists!
>
> file_open tempfile.2.tmp? Already exists!
>
> file_open tempfile.3.tmp? Already exists!
>
> ...some time later
>
> file_open tempfile.5800.tmp? Yes!

Wow.

Are you sure that this is in prop-base, not .svn/tmp?

For 1.7 we made the tempfilename generator better in guessing new names, but
for property handling we won't be using files in 1.7. (Looking at these
numbers and those that follow later in your mail, we might have to look in
porting some of this back to 1.6).

Properties will be moved in wc.db, to remove the file accesses completely.
(We can update them with the node information in a single transaction;
without additional file accesses)

> For N files in a directory, that means subversion ends up doing (n^2 +
n)/2
> calls to file_open. In our case that means it's testing for file existence
> 16,822,900 times (!) in order to do a full update. Even with just 100
files
> in a directory that's 5,050 tests.
>
>
>
> Is there any inherent reason these files need to be generated
sequentially?
> From reading the comments in 'svn_io_open_uniquely_named' it sounds like
> these files are named sequentially for the benefit of people looking at
> conflicts in their working directory. As these files are being generated
> within the 'magic' .svn folder, is there any reason to number them
> sequentially? Just calling rand() until there were no collisions would
> probably give a huge increase in performance.

In 1.7 we have a new api that uses a smarter algorithm, but we can't add
public apis to 1.6 now.

> I appreciate that we're probably an edge case with ~6000 files, but it
seems
> that issue 2) is a relatively straightforward change which would yield
clear
> benefits even for more sane repositories (and across all platforms too).
>
>
>
> In case it's relevant, I'm using the CollabNet build of subversion on
> Windows 7 64bit. Here's 'svn --version':
>
> C:\dev\CW_br2>svn --version

This issue is actually worse on Windows then on linux, because NTFS is a
fully transactional filesystem with a more advanced locking handling. And
for this it needs to do more to open a file. (Some tests I performed 1.5
year ago indicated that NTFS is more than 100 times slower on handling
extremely small files, then the EXT3 filesystem on Linux. While througput
within a single file is not far apart).

Bert
Received on 2010-04-09 14:27:39 CEST

This message: [ Message body ]
Next message: Alec Kloss: "Re: Pushmi support (was Re: Subversion Vision and Roadmap Proposal)"
Previous message: Tim Starling: "Pushmi support (was Re: Subversion Vision and Roadmap Proposal)"
In reply to: Paul Holden: "Severe performance issues with large directories"
Next in thread: Paul Holden: "Re: Severe performance issues with large directories"
Reply: Paul Holden: "Re: Severe performance issues with large directories"
Reply: Geoff Rowell: "RE: Severe performance issues with large directories"
Reply: Greg Stein: "Re: Severe performance issues with large directories"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]