CC'ing more possible experts Dmitry, Evgeny: any thoughts about whether 
this change makes sense for FSFS commit?
- Julian
Julian Foad wrote:
> TL;DR: I propose a change to the FSFS commit-transaction function, to 
> release the proto-rev write lock if an error occurs while it has this lock.
> 
> The practical applications of this change are rather obscure, which 
> perhaps explains why it has not been needed before.  In particular, it 
> apparently is not needed for the way the rest of standard Subversion 
> drives FSFS, but may be needed for other users of FSFS.  I have come 
> across this case in WANdisco's replicator, but as there are other 
> peculiarities in how that drives FSFS, let us not confuse the issue by 
> looking too closely at it.  It appears the issue would apply to other 
> users of FSFS too.
> 
> In the FSFS commit-transaction code path (in svn_fs_fs__commit) there is 
> a region where it acquires an exclusive write lock on the prototype 
> revision (proto-rev).  There are cases where code in this region can 
> fail, and there is no release of the lock in the error return path. That 
> means if the calling process re-tries, the "writing" flag is still set 
> in the transaction object in memory, and this causes an "already locked" 
> error.
> 
> In regular Subversion we abandon a transaction if it fails at this 
> stage, and so never hit the problem.  There are failure modes where a 
> re-try could not succeed, notably after we move the proto-rev file into 
> its final location, breaking the transaction; this case is called out in 
> comments in the function and will remain after this change.  Abandoning 
> the transaction is a safe and effective way to use FSFS.  However, other 
> users of FSFS may prefer to re-try in certain other cases.
> 
> The case WANdisco encountered is where some old repository corruption 
> (SVN-4858) was detected and reported at some point in this code region. 
>   Although the commit would not be able to succeed, it was important to 
> them that the same error should be reported again during a re-try, and 
> what was observed instead was that the "already locked" error was thrown 
> instead.
> 
> I suppose disk being temporarily inaccessible is one class of error 
> where a re-try might be successful.
> 
> The attached test and patch demonstrate and fix the problem.
> 
> This patch does not attempt to make it possible to re-try a failed 
> commit in all cases.  Some remaining cases are noted in the patch log 
> message which is repeated here:
> 
> [[[
> Roll back the transaction lock state so re-trying a failing commit 
> results in the same error again instead of an "already locked" error.
> 
> The problem was that when a commit returned a failure from within the 
> code region where it held a transaction proto-rev lock, it did not 
> unlock it.  Although the FSFSWD replicator replaces the transaction 
> files on disk, the lock status remained on the transaction object in 
> memory and a subsequent retry then failed with "already locked" error 
> instead of the same failure mode as the first attempt.
> 
> The solution here is to reset the lock status of the in-memory 
> transaction object before returning a commit failure.
> 
> This implementation addresses cases where the commit fails and returns 
> an error (e.g. due to detecting repository corruption at this point, as 
> in the case reported in NV-7983), and the lock can be successfully 
> unlocked using the regular unlock code path.
> 
> Cases not addressed:
> 
>    * There are conceivable failure modes where this regular unlocking 
> would not succeed, e.g. disk files becoming inaccessible, and this patch 
> would not address those cases.  These could perhaps be addressed by 
> adding a lock clean-up function that ignores errors in clean-up, and 
> using that instead of the regular unlocking code.
> 
>    * This implementation does not address cases where the process 
> crashes in this code region.  (In such cases the in-memory 'is writable' 
> flag would not be preserved anyway so that is out of scope.)
> 
> ### NOT YET TESTED:
> For FSFSWD, this implementation should also work where a failure occurs 
> after moving the proto-rev file to the final revision-file location. 
> FSFSWD re-copies the transaction directory before re-trying, and so this 
> should succeed.  For regular FSFS, it does not address this case: a 
> re-try in this case will fail to open the transaction proto-rev file.
> 
> ### This patch includes debugging code.
> 
> * subversion/libsvn_fs_fs/transaction.c
>    (TEST_COMMIT_FAIL): Debug code.
>    (ERR_PROTO_REV_LOCKED): New macro.
>    (commit_body): Use the new macro to handle errors in the region where 
> a proto-rev lock is held, and unlock it in those cases.
> 
> ]]]
> 
> My question is, does this change (without the debugging code) make sense 
> as an improvement to FSFS?
> 
> - Julian
Received on 2020-06-17 17:27:22 CEST