[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Issue #4588, part 1: FSFS access error

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Wed, 26 Aug 2015 05:30:41 +0000

(abbreviating "instance ID" to "IID")

Stefan Fuhrmann wrote on Tue, Aug 25, 2015 at 15:13:04 +0100:
> On Tue, Aug 25, 2015 at 7:15 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name>
> wrote:
>
> > Evgeny Kotkov wrote on Tue, Aug 25, 2015 at 02:47:16 +0300:
> > > Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com> writes:
> > >
> > > > My current hypothesis is that the server did not get restarted after
> > > > replacing the repository. Because we decided not to make the instance
> > ID
> > > > part of the cache key, we could easily have picked up cached format 6
> > data
> > > > for the format 7 repository.
> > >
> > > [...]
> > >
> > > > That said, are we still happy with the decision to not make the
> > instance
> > > > ID part of the cache key? The rationale has basically been "fail early"
> > > > because failure to restart or reconfigure the server after the repo
> > files
> > > > got modified might lead to any kind of unknown problems (much) further
> > down
> > > > the road.
> > >
> > > As I see it, there are two separate problems:
> > >
> >
> > I see it the same way.
> >
> > > 1) First of all, I am thinking that we should indeed revisit the
> > decision of
> > > not making instance IDs a part of the cache keys. As far as I
> > recall, this
> > > part of the change was reverted after we agreed that failing fast is
> > better
> > > than narrowing the window when the cache issues exist [1] (there are
> > still
> > > scenarios like backing up and restoring the repository with cp -a).
> > >
> > > Now I am almost sure that we should redo this change. The experience
> > of
> > > receiving errors related to stale cache entries isn't exactly
> > educating,
> > > but rather than that, it's frustrating,
>
>
> I think this is the core issue. If you do it wrong, you are
> likely to experience frustration at some later point, this
> is the very definition of doing it "wrong". So, if we want
> to spare the user this frustration without them changing
> their behaviour, we must be sure to *always* make it
> "right".

Well, yes, but another reasonable option would be to throw a clear and
unambiguous error as soon as the user has did something he shouldn't
have. I imagine an error such as:

svn: E123456: Repository replaced without server restart; restart the server

... although I realize it might not be easy to detect that situation.
Could we re-read the IID under the commit write lock and compare it to
the cached value? I.e., check that either the IID was unset at
svn_fs_open2() time and is still unset at the start of commit_body(), or
that it was set at svn_fs_open2() time and is still set to the same
value on disk? That wouldn't be bulletproof,¹ but it should narrow the
window for the race.

¹ For example, someone could replace the repository while commit_body()
is running after it checked the IID.

> > > and the procedures describing dump,
> > > load and hotcopy rarely say something about a server restart.
>
>
> Yes, the SVN book is not very explicit about that and
> the release notes have been amended only yesterday.
>
> In the past, our thinking has been along the lines of "It's a
> DB server and who in their right mind would fiddle with
> the disk data and expect everything to work just fine?"
> To an occasional user, it may be much less clear that e.g.
> a tool called "svnadmin" does NOT interact with running
> servers whenever necessary.
>

(See below.)

> > What are the downsides to having instance IDs as part of the cache key?
> > I haven't been able to understand that from your post or from Ben's post
> > you link to (53F4C92B.2010206_at_reser.org), which only mentions a "failure
> > to clear the cache race that was discussed offlist".
> >
> > Is this the problem scenario? ---
> >
> > 1. Open an svn_fs_t handle
> >
> > 2. Replace the repository with another repository having the same UUID
> > but different contents
> >
> > 3. Make a commit from the svn_fs_t opened in (1)
> >
> > 4. The commit creates a corrupted revision due to stale (false positive)
> > cache hits
> >
>
> That is the worst-case scenario. By adding the instance ID
> to the cache key, the only race I can think of is:
>
> * open svn_fs_t, read repo format (sharding!) and config
> * replace repo on disk
> * begin txn
> * ...
> * commit
>
> SVN 1.9 rereads the format upon commit, *probably* making
> the commit fail if things are inconsistent now (txn and rev paths).
> If the repo gets replaced earlier that it's obviously no problem.
> If it was replaced later, the TXN directory will very likely be
> missing.
>
> However, there is more state than caches. So, here are two
> counter-arguments to relying on the instance ID:
>
> * Client connections still become invalid, i.e. reports in flight
> are likely to error out, future requests may or may not result
> in an error.

I don't see a way we can make it work in the general case. For example,
even without IIDs, we would still error out if the repository is
replaced with an 'svndumpfilter exclude'ed copy of itself, and somebody
is checking out the excluded path.

I don't think there's an easy solution here. The FSFS code is designed
around the assumption that history is immutable, and replacing the
repository under a server's feet breaks that assumption. We can make
the cache staleness issues harder to trigger, but we cannot prevent them
entirely.

Perhaps we can do something similar to the 'revprop generation' trick?
I.e., have all the revision data caches keyed by a 'revision generation'
number as well, which 'svnadmin create' and 'svnadmin load' increment?
(I'm not entirely sure that makes sense; I'm just saying, the problem of
replacing revision data is comparable to the problem of detecting
staleness of the in-memory revprop cache.)

>

Another thought: opening revision files using openat(2) instead of
open()-by-name might handle a few more repository-replacement scenarios.
(Those in which the original REPOS_DIR is moved or deleted, rather than
overwritten in-place.) It's another way to make the "revision data
caches are stale" issue harder to hit.

> * The instance ID is only available in format 7 repositories.
> The problem exists in any dump/load scenario - even if
> the format number does not change.
>
> The first one is an inconvenience, the second one is somewhat
> stronger.
>

Cheers,

Daniel

> -- Stefan^2.
Received on 2015-08-26 07:31:08 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.