Mike Brenner wrote:
>> ... NTFS is slow to open many small files ...
>
>
> NTFS also slows down as the number of files
> in one directory grows. Above a certain number
> of files, NTFS runs slower than the time it
> takes to compress and decompress the files,
> according to timings from our lab.
Yes, but according to the sharding tests I ran with a script from
Malcolm Rowe's when he first added sharding, test script the effect for
straight file opens doesn't occur till a much greater number of files
exist (around 400K).
See attached mail.
-Nathan
attached mail follows:
Malcolm Rowe wrote:
> On Tue, Mar 20, 2007 at 11:51:50AM -0400, Nathan Kidd wrote:
>> I suspect, of all these file systems, NTFS would have the most drastic
>> performance difference.
>>
>
> I was surprised not to see more difference between ext2/ext3 and the
> other tree-based filesystems (ReiserFS, ext3+dir_index, etc). Then
> again, it truly can't be worse than vfat!
Well, after spending the better part of the last 24 hours thrashing my
poor hard drive I've got some results, though not very conclusive/happy
ones. I don't post this to the dev@ list simply because I'm not sure
what to make of it and suspect the test is somehow flawed.
At this point I don't have any idea of what factors are involved in the
performance issue, since it doesn't seem to be strictly tied to a magic
number of files.
-Nathan
------------------------------------------
Notes:
1. Performance on NTFS is fine right up till around the ~400,000 file
mark. After that it plunges. (See raw numbers below).
2. Once performance starts to drop (>400K files) the sharded scheme
actually performs *worse* than linear. This, particularly, makes me
wonder what other factors are coming into play here.
3. Sharding does make browsing via Explorer possible. I could view a 1
million file sharded scheme without problem. A linear directory with 2
1/4 million files still had Explorer churning at max CPU, 16 *CPU* hours
later (with no visible results, explorer had to be terminated).
4. The performance drop point is hard to pinpoint. On all tests going
over a wide range (2 -> 1,000,000 incremental) the drop happens in the
same range (between 200K and 500K files). However, when modifying the
test cycle to work in smaller increments things start to get strange.
Three columns with number of revisions and matching time in seconds:
(2 ** cycle) start 300k, inc 25k start 375k, inc 5k
2048 13 325001 18.109 380001 33
4096 14 350001 18.438 385001 33
8192 13 375001 20.266 390001 28
16384 13 400001 65.188 <-- ok 395001 29
32768 14 425001 103.062 400001 28
65536 15 450001 141.844 405001 26
131072 15 475001 169.406 410001 22
262144 18 500001 201.422 415001 19
524288 195 <-- ok 525001 233.828 420001 17
1048576 829 550001 267.438 425001 19 <-- ???
In other words with minute changes the effect isn't reproduced. The 3rd
column could be explained by a cache getting hot, but then that doesn't
explain the other columns.
5. All tests done on Windows 2003, 7200 RPM IDE drive, NTFS, 60gb
partition with ~8GB free space and over 1 million files/ 100K
directories (before running tests).
6. I tried both the python and c++ (converted to pure time.h) and found
that for < ~400K files c++ was very slightly faster. Going over the
400K limit, c++ performance still nose dived, but at a lower rate.
8. More complete test run output below.
Python runs:
Shareded Scheme Linear Scheme
2 10.156 2 10.219
4 10.094 4 10.219
8 10.046 8 10.281
16 10.047 16 10.188
32 10.109 32 10.25
64 10.266 64 10.234
128 10.109 128 10.171
256 10.203 256 9.719
512 12.891 512 12.141
1024 13.844 1024 14.094
2048 14.062 2048 15.438
4096 14.375 4096 14.828
8192 14.36 8192 15.203
16384 15.188 16384 15.422
32768 18.5 32768 15.75
65536 15.906 65536 16.094
131072 15.735 131072 17.688
262144 17.328 262144 17.141
524288 475.312 524288 252.485
[patience ends] 750001 543.047
1050001 971.922
1400001 1190.047
1800001 1370.406
Python, initially loading up 300K revisions, then increment by 25K.
325001 18.109 325001 17.375
350001 18.438 350001 17.422
375001 20.266 375001 17.36
400001 65.188 400001 50
425001 103.062 425001 101.782
450001 141.844 450001 194.766
475001 169.406 475001 212.047
500001 201.422 500001 206.204
525001 233.828 525001 276.406
550001 267.438 550001 291.188
575001 297.781 575001 431.594
600001 387.875 600001 342.296
625001 372.11 625001 336.125
650001 396.891 650001 364.406
C++ runs:
Sharded Scheme Linear Scheme
2 10 2 9
4 11 4 9
8 11 8 9
16 10 16 9
32 10 32 9
64 10 64 8
128 10 128 9
256 11 256 9
512 13 512 12
1024 13 1024 12
2048 13 2048 14
4096 14 4096 13
8192 13 8192 14
16384 13 16384 14
32768 14 32768 15
65536 15 65536 17
131072 15 131072 17
262144 18 262144 18
524288 195 524288 156
1048576 829 1048576 649
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Wed Nov 14 16:50:42 2007