Re: Issue #4588, part 1: FSFS access error

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Tue, 25 Aug 2015 15:13:04 +0100

On Tue, Aug 25, 2015 at 7:15 AM, Daniel Shahaf <d.s_at_daniel.shahaf.name>
wrote:

> Evgeny Kotkov wrote on Tue, Aug 25, 2015 at 02:47:16 +0300:
> > Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com> writes:
> >
> > > My current hypothesis is that the server did not get restarted after
> > > replacing the repository. Because we decided not to make the instance
> ID
> > > part of the cache key, we could easily have picked up cached format 6
> data
> > > for the format 7 repository.
> >
> > [...]
> >
> > > That said, are we still happy with the decision to not make the
> instance
> > > ID part of the cache key? The rationale has basically been "fail early"
> > > because failure to restart or reconfigure the server after the repo
> files
> > > got modified might lead to any kind of unknown problems (much) further
> down
> > > the road.
> >
> > As I see it, there are two separate problems:
> >
>
> I see it the same way.
>
> > 1) First of all, I am thinking that we should indeed revisit the
> decision of
> > not making instance IDs a part of the cache keys. As far as I
> recall, this
> > part of the change was reverted after we agreed that failing fast is
> better
> > than narrowing the window when the cache issues exist [1] (there are
> still
> > scenarios like backing up and restoring the repository with cp -a).
> >
> > Now I am almost sure that we should redo this change. The experience
> of
> > receiving errors related to stale cache entries isn't exactly
> educating,
> > but rather than that, it's frustrating,

I think this is the core issue. If you do it wrong, you are
likely to experience frustration at some later point, this
is the very definition of doing it "wrong". So, if we want
to spare the user this frustration without them changing
their behaviour, we must be sure to *always* make it
"right".

> > and the procedures describing dump,
> > load and hotcopy rarely say something about a server restart.

Yes, the SVN book is not very explicit about that and
the release notes have been amended only yesterday.

In the past, our thinking has been along the lines of "It's a
DB server and who in their right mind would fiddle with
the disk data and expect everything to work just fine?"
To an occasional user, it may be much less clear that e.g.
a tool called "svnadmin" does NOT interact with running
servers whenever necessary.

> > As we have
> > the mechanism to make a huge set of cases work without the necessity
> of
> > performing additional actions, I think that we should do it, leaving
> the
> > edge cases as they are. In other words, people who are used to
> svnadmin
> > dump/load/hotcopy shouldn't struggle, because we think that they also
> could
> > be doing something like the aforementioned cp -a.
>

Well, getting all cases that involve svnadmin right would
be enough in my book. People actually fiddling with
repository details on disk need to know what they are
doing. Those are the same kind of people that need to
know about the rep-cache.db.

We would also need an explicit way to bump the
instance ID to support "restore" scenarios to cover the
full range of admin activity. Maybe some cache reset
UI on the server would be useful as well.

> What are the downsides to having instance IDs as part of the cache key?
> I haven't been able to understand that from your post or from Ben's post
> you link to (53F4C92B.2010206_at_reser.org), which only mentions a "failure
> to clear the cache race that was discussed offlist".
>
> Is this the problem scenario? ---
>
> 1. Open an svn_fs_t handle
>
> 2. Replace the repository with another repository having the same UUID
> but different contents
>
> 3. Make a commit from the svn_fs_t opened in (1)
>
> 4. The commit creates a corrupted revision due to stale (false positive)
> cache hits
>

That is the worst-case scenario. By adding the instance ID
to the cache key, the only race I can think of is:

* open svn_fs_t, read repo format (sharding!) and config
* replace repo on disk
* begin txn
* ...
* commit

SVN 1.9 rereads the format upon commit, *probably* making
the commit fail if things are inconsistent now (txn and rev paths).
If the repo gets replaced earlier that it's obviously no problem.
If it was replaced later, the TXN directory will very likely be
missing.

However, there is more state than caches. So, here are two
counter-arguments to relying on the instance ID:

* Client connections still become invalid, i.e. reports in flight
  are likely to error out, future requests may or may not result
  in an error.
* The instance ID is only available in format 7 repositories.
  The problem exists in any dump/load scenario - even if
  the format number does not change.

The first one is an inconvenience, the second one is somewhat
stronger.

-- Stefan^2.
Received on 2015-08-25 17:06:52 CEST

This message: [ Message body ]
Next message: Thomas Singer: "Re: JavaHL, 1.9: "Bad file descriptor", "Stream doesn't support thiscapability" errors"
Previous message: C. Michael Pilato: "New 'svn cp --pin-externals' feature compat question."
In reply to: Daniel Shahaf: "Re: Issue #4588, part 1: FSFS access error"
Next in thread: Daniel Shahaf: "Re: Issue #4588, part 1: FSFS access error"
Reply: Daniel Shahaf: "Re: Issue #4588, part 1: FSFS access error"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]