[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS commit failure should release txn proto-rev lock

From: Julian Foad <julianfoad_at_apache.org>
Date: Tue, 28 Jul 2020 14:44:49 +0100

Ping: can anyone comment on this proposed patch?

- Julian

Julian Foad wrote:
> CC'ing more possible experts Dmitry, Evgeny: any thoughts about whether
> this change makes sense for FSFS commit?
>
> - Julian
>
>
> Julian Foad wrote:
>> TL;DR: I propose a change to the FSFS commit-transaction function, to
>> release the proto-rev write lock if an error occurs while it has this
>> lock.
>>
>> The practical applications of this change are rather obscure, which
>> perhaps explains why it has not been needed before.  In particular, it
>> apparently is not needed for the way the rest of standard Subversion
>> drives FSFS, but may be needed for other users of FSFS.  I have come
>> across this case in WANdisco's replicator, but as there are other
>> peculiarities in how that drives FSFS, let us not confuse the issue by
>> looking too closely at it.  It appears the issue would apply to other
>> users of FSFS too.
>>
>> In the FSFS commit-transaction code path (in svn_fs_fs__commit) there
>> is a region where it acquires an exclusive write lock on the prototype
>> revision (proto-rev).  There are cases where code in this region can
>> fail, and there is no release of the lock in the error return path.
>> That means if the calling process re-tries, the "writing" flag is
>> still set in the transaction object in memory, and this causes an
>> "already locked" error.
>>
>> In regular Subversion we abandon a transaction if it fails at this
>> stage, and so never hit the problem.  There are failure modes where a
>> re-try could not succeed, notably after we move the proto-rev file
>> into its final location, breaking the transaction; this case is called
>> out in comments in the function and will remain after this change.
>> Abandoning the transaction is a safe and effective way to use FSFS.
>> However, other users of FSFS may prefer to re-try in certain other cases.
>>
>> The case WANdisco encountered is where some old repository corruption
>> (SVN-4858) was detected and reported at some point in this code
>> region.   Although the commit would not be able to succeed, it was
>> important to them that the same error should be reported again during
>> a re-try, and what was observed instead was that the "already locked"
>> error was thrown instead.
>>
>> I suppose disk being temporarily inaccessible is one class of error
>> where a re-try might be successful.
>>
>> The attached test and patch demonstrate and fix the problem.
>>
>> This patch does not attempt to make it possible to re-try a failed
>> commit in all cases.  Some remaining cases are noted in the patch log
>> message which is repeated here:
>>
>> [[[
>> Roll back the transaction lock state so re-trying a failing commit
>> results in the same error again instead of an "already locked" error.
>>
>> The problem was that when a commit returned a failure from within the
>> code region where it held a transaction proto-rev lock, it did not
>> unlock it.  Although the FSFSWD replicator replaces the transaction
>> files on disk, the lock status remained on the transaction object in
>> memory and a subsequent retry then failed with "already locked" error
>> instead of the same failure mode as the first attempt.
>>
>> The solution here is to reset the lock status of the in-memory
>> transaction object before returning a commit failure.
>>
>> This implementation addresses cases where the commit fails and returns
>> an error (e.g. due to detecting repository corruption at this point,
>> as in the case reported in NV-7983), and the lock can be successfully
>> unlocked using the regular unlock code path.
>>
>> Cases not addressed:
>>
>>    * There are conceivable failure modes where this regular unlocking
>> would not succeed, e.g. disk files becoming inaccessible, and this
>> patch would not address those cases.  These could perhaps be addressed
>> by adding a lock clean-up function that ignores errors in clean-up,
>> and using that instead of the regular unlocking code.
>>
>>    * This implementation does not address cases where the process
>> crashes in this code region.  (In such cases the in-memory 'is
>> writable' flag would not be preserved anyway so that is out of scope.)
>>
>> ### NOT YET TESTED:
>> For FSFSWD, this implementation should also work where a failure
>> occurs after moving the proto-rev file to the final revision-file
>> location. FSFSWD re-copies the transaction directory before re-trying,
>> and so this should succeed.  For regular FSFS, it does not address
>> this case: a re-try in this case will fail to open the transaction
>> proto-rev file.
>>
>> ### This patch includes debugging code.
>>
>> * subversion/libsvn_fs_fs/transaction.c
>>    (TEST_COMMIT_FAIL): Debug code.
>>    (ERR_PROTO_REV_LOCKED): New macro.
>>    (commit_body): Use the new macro to handle errors in the region
>> where a proto-rev lock is held, and unlock it in those cases.
>>
>> ]]]
>>
>> My question is, does this change (without the debugging code) make
>> sense as an improvement to FSFS?
>>
>> - Julian
>
Received on 2020-07-28 15:44:56 CEST

This is an archived mail posted to the Subversion Dev mailing list.