Re: Efficient and effective fsync during commit

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Fri, 12 Jun 2015 14:11:13 +0200

On Fri, May 29, 2015 at 6:23 PM, Ivan Zhakov <ivan_at_visualsvn.com> wrote:

> On 29 May 2015 at 18:55, Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
> wrote:
> > On Fri, May 29, 2015 at 4:14 PM, Ivan Zhakov <ivan_at_visualsvn.com> wrote:
> >> On 28 May 2015 at 20:47, Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
> wrote:
> >>> Hi all,
> >>>
> >>> Most of us would agree that way we fsync FS changes
> >>> in FSFS and FSX is slow (~10 commits / sec on a SSD,
> >>> YMMV) and not even all changes are fully fsync'ed
> >>> (repo creation, upgrade).
> >>>
> >> The first question is it really a problem?
> >
> > Recently, we had customers wondering why their servers
> > wouldn't serve more than 20 commits/s (even on enterprise
> > SSDs and with various OS file system tuning options).
> > With QA bots constantly creating snapshots and tags,
> > there isn't too much head room anymore.
> >
> Ack. Was it Windows or Linux?
>

I do not know. My guess would be some Unix.

> >> I mean that usually commits
> >> are not that often. They are maintenance tasks like 'svnadmin load'
> >> that perform commits very often, but it could be fixed with
> >> '--fsfs-no-sync' option to 'svnadmin load' like we had for BDB.
> >
> > That would be a workable approach. Adding a bunch
> > of if statements.
> Something like 5-15..
>

Sure. It's not impossible to do but not exactly a
local change either. I'd rather centralize the fsync
management before adding switches to it.

> > It would not help for svnsync, though.
> Agree. We could expose this option in fsfs.conf, like we had in BDB
> but it would dangerous for production use even with properly power
> loss protected hardware.
>

It's hard to say where to put that option. Could be
per operation or per storage / repo. The latter implies
that a server config option would also be a possibility.

What we would ideally want is something that is not
even generally available, IIC: Flush cache buffers
but not the HW buffers. Some overhead but little added
latency and relying on battery-backed storage.

>>> From a high-level perspective, a commit is are simple
> >>> 3-step process:
> >>>
> >>> 1. Write rev contents & props to their final location.
> >>> They are not accessible until 'current' gets bumped.
> >>> Write a the new 'current' temporary contents to a temp file.
> >>> 2. Fsync everything we wrote in step 1 to disk.
> >>> Still not visible to any other FS API user.
> >>> 3. Atomically switch to the new 'current' contents and
> >>> fsync that change.
> >>>
> >>> Today, we fsync "locally" as part of whatever copy or
> >>> move operation we need. That is inefficient because
> >>> on POSIX platforms, we tend to fsync the same folder
> >>> more than once, on Windows we would need to sync
> >>> the files twice (content before and metadata after the
> >>> rename). Basically, missing the context info, we need
> >>> to play it safe and do much more work than we would
> >>> actually have to.
> >>>
> >>> In the future, we should implement step 1 as simple
> >>> non-fsync'ing file operations. Then explicitly sync every
> >>> file, and on POSIX the folders, once. Step 2 does not
> >>> have any atomicity requirements. Finally, do the 'current'
> >>> rename. This also only requires a single fsync b/c
> >>> the temp file will be in the same folder.
> >>>
> >>> On top of that, all operations in step 2 can be run
> >>> concurrently. I did that for FSX on Linux using aio_fsync
> >>> and it got 3x as fast. Windows can do something similar.
> >>> I wrapped that functionality into a "batch_fsync" object
> >>> with a few methods on it. You simply push paths into it,
> >>> it drops duplicates, and finally you ask it to fsync all.
> >>>
> >> I didn't find any documentation that calling FlushFileBuffers() on one
> >> handle flushes changes (data and metadata) made using other handle.
> >> I'm -1 to rely on this without official documentation proof. At least
> >> for FSFS.
> >
> > If you assume / suspect that FlushFileBuffers() only
> > operates on the open handle, i.e. only flushes those
> > changes made through that thandle, then you assume
> > that our commit process is seriously broken:
> >
> > For every PUT, we open the protorev file, append the
> > respective txdelta and close the file again. Since the
> > final flush uses yet another handle, this implies that
> > most of the revision data in each rev file does not get
> > fsync'ed and may be lost upon power failure.
> >
> > You might be right. So, if you care about repository
> > integrity, you should use your MSDN subscription and
> > ask MS for clarification on FlushFileBuffers() behaviour.
> You also may request MSDN subscription and ask MS for clarification or
> keep Windows code as it was before.
>

To be clear: You are proposing that the code on Windows
is fundamentally broken (revision contents not being
committed) while I think we "only" have a persistence
issue with renames. Since your business depends on
you being wrong, it would be in your best self-interest
to go and find out ...

Of course, I could apply for an MSDN subscription, wait
for it be approved etc. but I think it would be fairer if you
could check the Windows side of things while I try to get
some answers for POSIX.

> > Things we would like to know:
> >
> > * Does FlushFileBuffers() also flush changes made to
> > the same file through different handles? For simplification
> > we may assume those other handles got closed and
> > were owned by the same process.
> >
>
>
> > * Is calling FlushFileBuffers() on the target of a rename /
> > move sufficient to flush all metadata? Does it also
> > flush outstanding file content changes?
> Calling FlushFileBuffers() on target is not sufficient due the problem
> I described before [1]: metadata changes are journaled, while data
> changes seems are not. So you may get to race condition when move
> operation is recorded in journal, while new file content is not
> written to disk. On system restart journal will be applied, resulting
> empty/old content file in place. That's why source file should be
> flushed to disk before move operation.
>

That's all good and well but when will the metadata be flushed?

Also, we only need to be aware of that content / metadata race
for the "big switch-over" that e.g. makes the next revision
visible to all. Until then, we only care about whether all changes
have been written to to disk or not. If they haven't, we don't
care about specifics because nobody will read the partially
written data.

> * Is there a way to efficiently flush multiple files, e.g.
> > through something like overlapped I/O?
> >
> > * Does passing the FILE_FLAG_WRITE_THROUGH and
> > FILE_FLAG_NO_BUFFERING flags to CreateFile()
> > guarantee that all contents has been stored on disk
> > when CloseHandle() returns? (Assuming the HW does
> > not lie about its write buffers).
> >
> FILE_FLAG_NO_BUFFERING is not related to disk caching: it's disable
> file buffer it require caller to perform only cluster aligned
> operations [2]
>

The source you are citing contradicts you:

"For more information on how *FILE_FLAG_NO_BUFFERING*
interacts with other cache-related flags, see *CreateFile*
<https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858>."

This flag does switch OS-side caching on and off.

Buffering *is* caching in Windows. For some unknown reason,
they called it "buffers" but then had to come up with some
term describing the entirety of buffers, aka "cache".

> With FILE_FLAG_WRITE_THROUGH flag disk cache is not used at all. I.e.
> changes goes directly to hardware with special bit to skip internal HW
> cache.

Yes. This controls the HW-side caching (but can be
overruled by global settings).

> Nothing is flushed to disk when CloseHandle() returns in this
> case.
>

I only added the "CloseHandle" part because that it what
we ultimately care about. Before that, we don't need data
to be persistent. So, if there was some magic to be applied,
I wanted to give as much leverage as possible.

> Disclaimer: My understanding of the fsync behaviour
> > on Windows is based on conjecture, gathered from the
> > few pieces of information that I could find online. I'm
> > happy to change my mind once new evidence shows
> > up. Right now, our implementation seems to be wasteful
> > and possibly incomplete - which is worse. I would love
> > to fix both for 1.10.
> >
>
> [1] http://svn.haxx.se/dev/archive-2013-05/0245.shtml
> [2] https://msdn.microsoft.com/en-us/library/windows/desktop/cc644950
>
>
I've been thinking about how to implement a batch
fsync feature with few platform specifics and next
to no overhead. I'll try to get the first bits committed
to FSX over the weekend.

-- Stefan^2.
Received on 2015-06-12 14:11:25 CEST

This message: [ Message body ]
Next message: Philip Martin: "Re: Blame behaviour change in 1.9"
Previous message: Stefan Fuhrmann: "Re: svn commit: r1682265 - /subversion/trunk/subversion/libsvn_fs_fs/util.c"
Maybe in reply to: Daniel KlÃma: "Re: Efficient and effective fsync during commit"
Next in thread: Ivan Zhakov: "Re: Efficient and effective fsync during commit"
Reply: Ivan Zhakov: "Re: Efficient and effective fsync during commit"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]