Re: Fixing the FSFS corruption bug

From: Garrett Rooney <rooneg_at_electricjellyfish.net>
Date: 2006-09-07 15:42:29 CEST

On 9/7/06, Malcolm Rowe <malcolm-svn-dev@farside.org.uk> wrote:
> On Tue, Sep 05, 2006 at 09:29:27PM +0100, Malcolm Rowe wrote:
> > + err = svn_io_file_open(file, path_txn_proto_rev_lock(fs, txn_id, pool),
> > + APR_WRITE | APR_CREATE | APR_EXCL, APR_OS_DEFAULT,
> > + pool);
>
> So I've just noticed that this method of creating a lockfile (using
> O_CREATE|O_EXCL) appears to be broken on NFSv3 mounts on Linux pre-2.6.5
> (including all 2.4 kernels), as described in [1], and as far as I can
> tell, isn't supported on NFSv2 at all ([2] backs this up, to some extent).
>
> Now, we don't support FSFS-on-NFSv2 as far as I'm aware, but I'd quite
> like to support 2.4 kernels for just a while yet :-). Also, I've no idea
> whether the same behaviour might not exist in NFS clients for other OSs.
>
> So I guess I either need to duplicate the mutex-and-fcntl dance that
> the FSFS code does with the fs-wide write lock, and work out some way of
> stashing a per-transaction mutex somewhere, or alternatively look into
> playing the NFSv2 link() dance, which I think actually works everywhere.
>
> The former looks like it should work: I can create a hash of 'known'
> transactions in fs_serialized_init, located via an APR userdata pointer,
> and then stash a per-transaction mutex into that hash, deleted on
> transaction commit/purge. The mutex will 'leak' if the caller abandons
> the transaction without committing or purging, in just the same way that
> the transaction remains on disk; I think that this should be okay,
> since I don't expect that many callers abandon transactions without
> committing them - though I can envisage convoluted scenarios where it
> would be a problem.

I think this is a reasonable approach to fixing the NFS problem.

> I'd appreciate some feedback on the current patch's concept first,
> before I go down the route of changing the implementation to work on NFS.

Considering that your recipie manages to reproduce exactly the same
kind of corruption, it seems likely that it's the actual problem we're
seeing. I do wonder what the actual underlying cause might be though.
The only thing I can come up with is some sort of transient network
error, where neon things it successfully closed a connection, but
apache/mod_dav_svn doesn't, so it keeps things open, then when neon
creates a new connection apache still has that file open. I have no
idea if this is really feasable, but I'm also having trouble
envisioning other ways it could occur.

So, lacking better ideas, this does seem like a worthwhile route to
take. It avoids a known way to corrupt revfiles, and with any luck
it's actually the same way that's corrupting them in the wild.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Sep 7 15:43:41 2006

This message: [ Message body ]
Next message: Kamesh Jayachandran: "[PATCH] function name change as per convention(round 3)."
Previous message: Mathias Weinert: "Re: Change in svn/repos.py for recognition of replaced paths in mailer.py"
In reply to: Malcolm Rowe: "Re: Fixing the FSFS corruption bug"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]