Re: FSFS commit failure should release txn proto-rev lock

From: Julian Foad <julianfoad_at_apache.org>
Date: Wed, 17 Jun 2020 16:27:20 +0100

CC'ing more possible experts Dmitry, Evgeny: any thoughts about whether
this change makes sense for FSFS commit?

- Julian

Julian Foad wrote:
> TL;DR: I propose a change to the FSFS commit-transaction function, to
> release the proto-rev write lock if an error occurs while it has this lock.
>
> The practical applications of this change are rather obscure, which
> perhaps explains why it has not been needed before. In particular, it
> apparently is not needed for the way the rest of standard Subversion
> drives FSFS, but may be needed for other users of FSFS. I have come
> across this case in WANdisco's replicator, but as there are other
> peculiarities in how that drives FSFS, let us not confuse the issue by
> looking too closely at it. It appears the issue would apply to other
> users of FSFS too.
>
> In the FSFS commit-transaction code path (in svn_fs_fs__commit) there is
> a region where it acquires an exclusive write lock on the prototype
> revision (proto-rev). There are cases where code in this region can
> fail, and there is no release of the lock in the error return path. That
> means if the calling process re-tries, the "writing" flag is still set
> in the transaction object in memory, and this causes an "already locked"
> error.
>
> In regular Subversion we abandon a transaction if it fails at this
> stage, and so never hit the problem. There are failure modes where a
> re-try could not succeed, notably after we move the proto-rev file into
> its final location, breaking the transaction; this case is called out in
> comments in the function and will remain after this change. Abandoning
> the transaction is a safe and effective way to use FSFS. However, other
> users of FSFS may prefer to re-try in certain other cases.
>
> The case WANdisco encountered is where some old repository corruption
> (SVN-4858) was detected and reported at some point in this code region.
> Although the commit would not be able to succeed, it was important to
> them that the same error should be reported again during a re-try, and
> what was observed instead was that the "already locked" error was thrown
> instead.
>
> I suppose disk being temporarily inaccessible is one class of error
> where a re-try might be successful.
>
> The attached test and patch demonstrate and fix the problem.
>
> This patch does not attempt to make it possible to re-try a failed
> commit in all cases. Some remaining cases are noted in the patch log
> message which is repeated here:
>
> [[[
> Roll back the transaction lock state so re-trying a failing commit
> results in the same error again instead of an "already locked" error.
>
> The problem was that when a commit returned a failure from within the
> code region where it held a transaction proto-rev lock, it did not
> unlock it. Although the FSFSWD replicator replaces the transaction
> files on disk, the lock status remained on the transaction object in
> memory and a subsequent retry then failed with "already locked" error
> instead of the same failure mode as the first attempt.
>
> The solution here is to reset the lock status of the in-memory
> transaction object before returning a commit failure.
>
> This implementation addresses cases where the commit fails and returns
> an error (e.g. due to detecting repository corruption at this point, as
> in the case reported in NV-7983), and the lock can be successfully
> unlocked using the regular unlock code path.
>
> Cases not addressed:
>
> * There are conceivable failure modes where this regular unlocking
> would not succeed, e.g. disk files becoming inaccessible, and this patch
> would not address those cases. These could perhaps be addressed by
> adding a lock clean-up function that ignores errors in clean-up, and
> using that instead of the regular unlocking code.
>
> * This implementation does not address cases where the process
> crashes in this code region. (In such cases the in-memory 'is writable'
> flag would not be preserved anyway so that is out of scope.)
>
> ### NOT YET TESTED:
> For FSFSWD, this implementation should also work where a failure occurs
> after moving the proto-rev file to the final revision-file location.
> FSFSWD re-copies the transaction directory before re-trying, and so this
> should succeed. For regular FSFS, it does not address this case: a
> re-try in this case will fail to open the transaction proto-rev file.
>
> ### This patch includes debugging code.
>
> * subversion/libsvn_fs_fs/transaction.c
> (TEST_COMMIT_FAIL): Debug code.
> (ERR_PROTO_REV_LOCKED): New macro.
> (commit_body): Use the new macro to handle errors in the region where
> a proto-rev lock is held, and unlock it in those cases.
>
> ]]]
>
> My question is, does this change (without the debugging code) make sense
> as an improvement to FSFS?
>
> - Julian
Received on 2020-06-17 17:27:22 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: svn commit: r1878918 - /subversion/trunk/build.conf"
Previous message: Tobias Bading: "Re: "svn list -vH" broken for files >1M"
In reply to: Julian Foad: "FSFS commit failure should release txn proto-rev lock"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]