Re: Experiments with FlushFileBuffers on Windows

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Sun, 28 Jun 2015 16:41:10 +0200

On Tue, Jun 23, 2015 at 5:09 PM, Ivan Zhakov <ivan_at_visualsvn.com> wrote:

> On 16 June 2015 at 22:57, Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
> wrote:
> > Hey there,
> >
> > One of the links recently provided by Daniel Klima pointed
> > to a way to enable write caching even on USB devices.
> > So, I could use my Windows installation for experiments now
> > without the risk of brick-ing 2 grand worth of disks by pulling
> > the plug tens of times.
> >
> >
> > TL;DR
> > =====
> > FlushFileBuffers operates on whole files, not just the parts
> > written through the respective handle. Not calling it after rename
> > results in potential data loss. Calling it after rename eliminates
> > the problem at least in most cases.
> >
> > Setup:
> > =====
> > I used the attached program to conduct 3 different experiments,
> > each trying a specific modification / fsync sequence. All would
> > write to an USB stick which had OS write cache enabled for it
> > in Windows 7.
> >
> > All tests run an unlimited number of iterations - until there is an
> > I/O error (e.g. caused by disconnecting out the drive). For each
> > run, separate files and different file contents will being written
> > ("run number xyz", repeated many times). So, we can determine
> > which file contents is complete and correct and whether all files
> > are present. Each successful iteration is logged to the console.
> > We expect the data for all these to be complete.
> >
> > The stick got yanked out at a random point in time, reconnected
> > after about a minute, chkdsk /f run on it and then the program
> > output would be compared with the USB stick's content.
>
> I've tried to repeat your tests, but I failed to do that:
> 1. Your attached program miss some scripts around to perform real tests.
>

That source should work with any MBCS Win32 console application.
For your convenience, I now attached the full VS solution.

> 2. I don't have the same USB stick that you used in your tests :)
>

Well, any device that can be suddenly removed should do the trick
(USB, eSATA, something networked, ...). Even an internal disk will
do if you are willing to pull the plug. A VM might work as well if it
is being killed without informing the OS beforehand.

Also I don't think that NTFS on removable flash USB drive could be
> used to simulate powerloss scenario on Windows: removable disks are
> not available during the system boot, so Windows cannot replay NTFS
> journal during startup.
>

Windows can replay the journal upon mounting the volume - just
like any other volume that was not active during system startup.
To make really sure that issues get fixed, I ran chkdsk on the
volume before examining it. Are you suggesting that only the
volumes present at boot time get additional checking?

Also, a journal can only replay what has been written to it. In that
respect it is no different from any other data on disk. For a rename
to be permanent, it has to be recorded on disk "somehow"
"somewhere" and that requires physical I/O. Unless rename
is very slow on spinning disks, no such I/O is happening.

The only half-way option that the OS has is to write the journal
entry directly into the disk cache (without flushing it). That's still
fast while being much safer than the OS cache. Virtualized disks
then behave like battery-backed disks and will not show data loss.

> Instead of this I tweaked 'repos-test 25' to emulate concurrent
> commits of 10kb files in 4 parallel threads (see attached very dirty
> patch).
>

That looks o.k. and should be able to reproduce issues. There are
two downsides compared to my simpler example code:

* The fsync after rename is only one of mutliple fsync ops during
a commit. Your chances of hitting it is 10..20% vs. 50 to ~100%
in my setup. So, you may need 10s of runs for good confidence.

* fsyncs in other threads might trigger metadata and journal flushes
  that effectively act like an fsync after the previous rename.
  Without further analysis of what will be sync'ed when by Windows,
  one could expect to hit a critical situation in even fewer cases.

IOW, that setup is complex enough to expose all sorts of problems
if they exist. But it may greatly reduce the incident rate for the one
problem that we are investigating.

> Then I've performed several tests on Windows Server 2012 R2 running
> VMWare Workstation 9 forcing power off after 300-400 commits. I've
> performed 10x tests and never get repository corruption even when I
> removed FlushFileBuffers() call *after* rename.

I assume you only ran 'svnadmin verify' or something to that
effect. Did you then verify that the last reported HEAD revision
was not lost?

Given my commentary above, 10 runs must not be enough
while there is no need to wait for 300..400 commits (in case
that takes considerable time in your test environment).

However, assuming that you actually compared expected
HEAD vs. reported HEAD, your test demonstrates that the
incident rate is low - far lower than what you would see with
no fsync at all.

> During the restart the
> OS may report that it recovered volume data, but after that the
> repository data remain in the consistent state. Removing other
> FlushFileBuffers() calls, results repository corruption after two
> runs.
>

That demonstrates that fsync is at least a meaningful operation
in your test setup and that resetting the VM can make you lose
at least some data (despite virtualized HW and specific drivers
that might change guest-OS-side buffering).

> While I agree that passing MOVEFILE_COPY_ALLOWED to MoveFileEx() is a
> bug, but calling FlushFileBuffers() is not necessary at least in case
> of NTFS on permanently connected disk. I suppose it happens because
> MoveFileEx() already journaled which means that journal is flushed to
> disk before operation completes. But we may add MOVEFILE_WRITE_THROUGH
> flag to make sure that this operation will be synced on other
> filesystems or network shares, but this require more Windows specific
> code.
>

Yes, that is a useful improvement.

In the meantime, I'm halfway through eliminating the need for
the final move-into-place of 'current' to be persistent in FSX.
For revprops and rev data, that already isn't a problem anymore
because they get written / completed and fsynced in their
location.

-- Stefan^2.

application/zip attachment: FSyncExperiment.zip

Received on 2015-06-28 16:42:24 CEST

This message: [ Message body ]
Next message: Evgeny Kotkov: "Re: Possible incompatibility of svn_repos_verify_fs2() in 1.9.0-rc1"
Previous message: Branko Čibej: "Re: Possible incompatibility of svn_repos_verify_fs2() in 1.9.0-rc1"
In reply to: Ivan Zhakov: "Re: Experiments with FlushFileBuffers on Windows"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]