Re: Handling assertions and malfunctions in mod_dav_svn

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Tue, 24 Feb 2015 13:20:01 +0100

On Tue, Feb 17, 2015 at 3:44 AM, Ben Reser <ben_at_reser.org> wrote:

>
> I think a potentially bigger problem might be what to do about cache state.
> Cache reads are obviously not a problem. Theoretically, a cache write
> shouldn't happen until after a repository modification so that if we fail
> the
> repository write then we shouldn't end up with stale cache data. The cache
> design to be thread safe should be the same as what we need for this. The
> only
> thing I'm fuzzy on is if it's possible to have an assertion in the middle
> of a
> cache write that creates invalid state. This sort of scenario is the type
> of
> thing that using something like memcached for caching would illuminate
> already
> but I don't think very many people use that support and instead almost all
> of
> our caches are using per process caches, that might have such problems.
> Solving this would be a lot harder than simply auditing the clearing of
> errors,
> it'd be a lot of careful review of cache writes and you have to ensure this
> behavior gets maintained forever (the error clearing requires that too but
> it's
> a fairly simple rule to follow).
>

I just went through the cache_membuffer code looking
for failure modes and this is what I found. Other caches
may behave differently.

* The assert()s within the cache code itself should not
  be in effect in a production (NDEBUG) environment.
* If they are and get triggered, the cache meta data is
  corrupt. A call to the new svn_cache__membuffer_clear
  will reset all meta data.

* A failure in the APR locking code may leave the
  cache in an inaccessible state. That state may be
  transient or permanent. Failing to return the write
  lock will cause an infinite wait for all readers.
  Clearing the cache will not work as it also needs
  to acquire the write lock first.

* Reads and writes use (de-)serialization callbacks
  that might fail with svn_error_t* != 0. Locks will
  still be released orderly.
* svn_cache__set serializes the data first into a temp.
  buffer and only then copies it into the cache. Callback
  failures will not corrupt the cache.
* svn_cach__set_partial modifies the data in-situ.
  If the callback fails, the entry gets removed from
  cache.

* Only svn_cach__get_partial and svn_cach__set_partial
  will execute their callbacks while holding a cache lock.
  Those callbacks may fail but must never abort().
* Some partial getters call into other svn libs which might
  use assert() / abort(). On systems that use the rwlock,
  aborting those will turn (segments of) the cache r/o but
  otherwise functional.
* There is only one partial setter, ATM, and that one does
  not use assert() / abort() nor SVN_ERR_ASSERT.

IOW, binaries that have assert() enabled may leave the
cache in an unusable state if the assertion is triggered
within the cache context. IIUC, there is no way to test
for that context, so we must assume that the cache is
dead.

svn_error_t* returns, OTOH, should be transient and will
not affect cache consistency and usability. Therefore,
SVN_ERR_ASSERT should behave like a normal error
rather than abort, i.e svn_error_malfunction needs to
return and svn_error_t.

Finally, svn_error_malfunction _cannot_ clear the cache
because that may require recursive locks. Instead the
cache clear should happen in the cache error handler
(svn_cache__set_error_handler).

This ultimately is something we have to solve, if we want to use a central
> server model in the future in order to have a central cache, we can't be
> aborting. So while I think doing this right might be slightly painful,
> long
> term it'll pay off.
>

Again, only speaking for cache_membuffer here, there are
multiple options. One would be clearing the cache. Another
is adding another key prefix such that new sessions will
use a separate key space. And there are more.

It all depends on how requests / sessions that have not
failed, yet, are supposed to be treated (best effort, forcibly
abort, ...). Another consideration would be the handling of
repeated failures. A central server could decide to "degrade"
service on a "frequently failing" repository by only using
minimal, session local caches for it. We should have that
on our radar once we start development on a central server.

-- Stefan^2.
Received on 2015-02-24 13:22:03 CET

This message: [ Message body ]
Next message: Branko ÄŒibej: "Re: svn info --show-item [was: svn info --detail]"
Previous message: Branko ÄŒibej: "Re: svn commit: r1661797 - /subversion/branches/1.9.x/STATUS"
In reply to: Ben Reser: "Re: Handling assertions and malfunctions in mod_dav_svn"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]