[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS format 6

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 24 Jan 2011 03:12:01 +0100

On Wed, Dec 29, 2010 at 8:37 PM, Stefan Fuhrmann <eqfox_at_web.de> wrote:
> On 29.12.2010 01:58, Johan Corveleyn wrote:
>> On Sun, Dec 12, 2010 at 4:23 PM, Stefan Fuhrmann
>> <stefanfuhrmann_at_alice-dsl.de>  wrote:
>>> On 19.10.2010 15:10, Daniel Shahaf wrote:
>>>> Greg Stein wrote on Tue, Oct 19, 2010 at 04:31:42 -0400:
>>>>> Personally, I see [FSv2] as a broad swath of API changes to align our
>>>>> needs with the underlying storage. Trowbridge noted that our current
>>>>> API makes it *really* difficult to implement an effective backend. I'd
>>>>> also like to see a backend that allows for parallel PUTs during the
>>>>> commit process. Hyrum sees FSv2 as some kind of super-key-value
>>>>> storage with layers on top, allowing for various types of high-scaling
>>>>> mechanisms.
>>>> At the retreat, stefan2 also had some thoughts about this...
>>> [This is just a brain-dump for 1.8+]
>>> While working on the performance branch I made some
>>> observations concerning the way FSFS organizes data
>>> and how that could be changed for reduced I/O overhead.
>>> notes/fsfs-improvements.txt contains a summary of that
>>> could be done to improve FSFS before FS-NG. A later
>>> FS-NG implementation should then still benefit from the
>>> improvements.
>> +(number of fopen calls during a log operation)
>> I like this proposal a lot. As I already told before, we are running
>> our FSFS back-end on a SAN over NFS (and I suspect we're not the only
>> company doing this). In this environment, the server-side I/O of SVN
>> (especially the amount of random reads and fopen calls during e.g.
>> log) is often the major bottleneck.
>> There is one question going around in my head though: won't you have
>> to change/rearrange a lot of the FS layer code (and maybe repos
>> layer?) to benefit from this new format?
> Maybe. But as far as I understand the current
> FSFS structure, data access is mainly chasing
> pointers, i.e. reading relative or absolute byte
> offsets and moving there for the next piece of
> information. If everything goes well, none of that
> code needs to change; the revision packing
> algorithm will simply produce different offset
> values.
>> The current code is written in a certain way, not particularly
>> optimized for this new format (I seem to remember "log" does around 10
>> fopen calls for every interesting rev file, each time reading a
>> different part of it). Also, if an operation currently needs to access
>> many revisions (like log or blame), it doesn't take advantage at all
>> of the fact that they might be in a single packed rev file. The pack
>> file is opened and seeked in just as much as the sum of the individual
>> rev files.
> The fopen() calls should be eliminated by the
> file handle cache. IOW, they should already be
> addressed on the performance branch. Please
> let me know if that is not the case.

Ok, finally got around to verifying this.

You are completely correct: the performance branch avoids the vast
amount of repeated fopen() calls. With a simple test (testfile with 3
revisions, executing "svn log" of it) (note: this is an unpacked 1.6

- trunk: opens each rev file between 19 and 21 times.

- performance branch: opens each rev file 2 times.

(I don't know why it's not simply 1 time, but ok, 2 times is already a
factor 10 better than trunk :-)).

I tested this simply by adding one line of printf instrumentation
inside libsvn_subr/io.c#svn_io_file_open (see patch in attachment, as
well as the output for trunk and for perf-branch).

Now, if only that file-handle cache could be merged to trunk :-) ...



Received on 2011-01-24 03:13:00 CET

This is an archived mail posted to the Subversion Dev mailing list.