Re: Subversion performance on large directories

From: Benjamin Pflugmann <benjamin-svn-usr_at_pflugmann.de>
Date: 2004-04-16 21:17:49 CEST

Hi!

On Fri 2004-04-16 at 16:02:16 +0100, you wrote
> > Wow... 20 minutes to stat 100K files? I don't know what to say.
> > That's 83 file-stats per second, which doesn't seem unreasonable
> > to me.

Hm. Well, not to pick an argument, but to me that looks slow: Although
that number is in the order of the number of seeks a disk is usually
capable of, my experience tells me to expect about 10 times as much,
roughly 1.000 seeks/sec (I think it's because not every read needs a
seek).

A short test (time ls -lR /usr > /dev/null; on a freshly mounted ext3
partition) shows that my current system (Linux Athlon XP 2000, IDE
disks) manages ~2000 seeks, my 4 year old Athlon 500 makes ~1400.

Just to be sure, I verified that most time is spent on disk seeks
indeed (by running both, ls and svn a second time: both then need only
about 1/30th of the time of the first run).

Considering that cached seeks are an order of magnitude faster,
stat'ing the same file several times (which Subversion does a lot,
too) shouldn't have much impact on the overall time.

Ah, okay, now I see it... if I add -a to ls, the time "svn st" and "ls
-lRa" need (on the same Subversion WC) is about the same, so the
additional time comes from accessing the .svn-area files. The rest can
probably explained with actually reading the .svn/entries files and so
on. (By the way this makes a factor of 4 for me, not 10, as I
initially expected above, so, Karl, you were as right as I were...
looks like Mark's disk is *that* slow ;-)

So the reason svn stat needs much time than other progs to stat 1 file
is that it actually performs operations on N files of the .svn-area,
additionally (not sure of which number N is, but looks like 3-4).

To conlude, the problem is not really this factor 4 or so, because
with only one stat per file the operation still would need 5 minutes
on Mark's system. If we want support huge WCs (like, 1.000.000 files),
the method for determining changed files would need to change, as even
a state-of-the-art computer will need up to half an hour to scan it.

But for what Mark wants, a faster disk or a (Software-)RAID, and some
memory for the clients should do the trick (see below, my workstation
needs 15 secs for the recurrent calls in Mark's use case).

> But on the smaller directory, it's managing 315 or so a second on the
> Same system...

Hm. Considering that stat'ing 140.000 files on my Linux machine uses
about 100MB for the file cache and such, maybe you are simply hitting
a memory limit (with your 128MB), so that not everything can be cached
and for some files the disks has to be accessed several times instead
of only one time (and afterwards using the content of the filesystem
cache).

> On the upside, I've just discovered that trying the same thing on a SCSI
> system (an old E220R, 450Mhz), I get 1155 file stats a second for the
> smaller directory (it completes in 3 seconds Instead of 11). So I think
> I've found my bottleneck :)

Yes, SCSI disks generally have faster seek times. RAID1 (Hardware or
Software) will also help seek times a lot. If my guess above is right,
having more memory on the client machines should also help.

Could you measure time for repeated (immediately) calls of svn stat
and what time ls -lR resp. ls -lRa take on the same directories?

If repeated calls (of either one) don't take far less time, I'd say
you hit a memory limit (for the filesystem cache).

> Luckily, we're getting a "proper" server Soon.

The measurement you have done are client specific (svn stat doesn't
contact the server the way you call it), so although it sounds like
you have done your benchmarks on the server for Subversion, replacing
the server - although benefitial to other operations - won't affect
the time "svn stat" on a client's WC needs (well, except if your
server is also a file-server exporting the WC via NFS or such... in
that case, ignore this comment).

> It just remains to be seen how this scales - This should manage the
> entire directory in 5 minutes.

With my computer (the Athlon XP, 1 year old), I'd expect about 3.5
minutes for the first run and 15 seconds for any following call (with
1GB, caching is no problem ;)

> I'll also take a look at the working practices etc. and see if we can
> Just check parts of the repository out at a time...

Or at least, you could concentrate on using the commands on a sub-tree
of the WC, most of the time.

Regards,

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Fri Apr 16 21:49:32 2004

This message: [ Message body ]
Next message: Henderson, Michael D: "[BOOK] Clarification on importing the initial repository layout"
Previous message: John Peacock: "Re: Base checksum mismatch?"
In reply to: mark_round_at_ipcmedia.com: "RE: Subversion performance on large directories"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]