Re: svn commit: r1040832 - Port a fix for a FSFS packing race

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Tue, 07 Dec 2010 15:27:05 +0000

(Quoting and replying to two emails at once.)

First: I'm assuming that no process will ever do packing without getting
the exclusive write lock. Can you confirm that for me? If that's
false, all my reasoning would be bogus.

On Tue, 2010-12-07 at 12:13 +0200, Daniel Shahaf wrote:
> Philip Martin wrote on Tue, Dec 07, 2010 at 09:49:40 +0000:
> > Julian Foad <julian.foad_at_wandisco.com> writes:
> >
> > > On Mon, 2010-12-06 at 18:44 +0000, Philip Martin wrote:
> > >> Julian Foad <julian.foad_at_wandisco.com> writes:
> > >>
> > >> > I'm going to drop this "Remove the re-try logic from
> > >> > svn_fs_fs__path_rev_absolute()" follow-up patch, as I don't want to get
> > >> > into checking or messing with the txn-correctness of FSFS now. If
> > >> > Daniel or anyone else wants to pursue it, I'd be glad to help.
> > >>
> > >> I thought the patch was an improvement. The existing retry in
> > >> svn_fs_fs__path_rev_absolute doesn't fix the race, it simply makes the
> > >> window smaller. As the patch makes the window larger we are more likely
> > >> to see and fix any places where the caller doesn't handle the race.
> > >
> > > I *think* the problem is that a caller may use this function within a
> > > transaction and depend on this internal re-try logic simply to update a
> > > stale cached min-unpacked-foo value. "Stale" in the sense that *this*
> > > transaction has changed the real value and hasn't updated the cached
> > > value.
>
> Yes, it's conceivable that a caller might do:
>
> for (int i = 0; i < 2; i++)
> {
> ...
> SVN_ERR(svn_fs_fs__path_rev_absolute());
> ...
> if (i == 0)
> /* other process runs svn_fs_pack(); */;
> [...]

No, that's not what I mean. I mean a caller might do:

  with a write lock:
    {
      svn_fs_pack();
      SVN_ERR(svn_fs_fs__path_rev_absolute(&path, ...));
      foo(path);
    }

where neither this code nor foo() re-tries. With the current version of
svn_fs_fs__path_rev_absolute(), foo() will get the correct path, whereas
if we remove the internal re-try then foo() will get the wrong path.

> > That's not really a race, it's more of a logic error.

That's right.

> If there are any
> > such cases then your patch should expose them and we will fix them. Any
> > such errors are easy to handle, compared to the difficulty of fixing
> > races.
> >
> > The real problem with the existing code is that it partially hides
> > races, by making them harder to trigger, but doesn't fix the race. If
> > we want to fix the race properly making it easier to trigger is the
> > right thing to do.

Agreed.

What I meant to indicate is that, before making this change, I would
like to see clearly that a caller with the write lock cannot use an
out-of-date value. One way to ensure that would be:

  update the cached value whenever we get the write lock (which we do);
  update the cached value whenever we update the value on disk (which we
don't);
  anything not using a write lock must accept that the cached value may
be stale, and re-try if necessary.

Another way would be:

  don't update the cached value when we get the write lock;
  don't update the cached value when we update the value on disk;
  every time we use the cached value within a write lock, either update
before using it or try and re-try if necessary;
  every time we use the cached value without a write lock, try and
re-try if necessary.

I don't have a strong preference between those, but the current
inconsistency between updating it when we get the write lock and not
updating it when we update the value on disk seems bad.

This doesn't mean we should keep the re-try loop that we're talking
about removing. I agree it should go. It's just that I'd like to
resolve this inconsistency, or at least agree about resolving it, before
feeling comfortable enough to commit this myself.

> As I said, I think all current callers are safe. However, I agree that
> applying the patch will make it easier to detect the mistake if in the
> future we add callers to svn_fs_fs__path_rev_absolute() that do not
> retry as they should.

OK, that's great.

> > > Daniel, you wrote:
> > >> But is it strictly *necessary* to do so? I think not: by not calling
> > >> update_min_unpacked_rev(), we cause ffd->min_unpacked_rev to become
> > >> stale --- a situation which the code has to account for anyway.
> > >
> > > That was true, then, and it meant that every place that used this value
> > > had to account for it possibly being stale. But now we have realized
> > > that the code shouldn't need to account for the possibility of the value
> > > being stale if it is running within a transaction (with the write lock).
> > >
> > > Now, I think it would be better to change this. Whenever the code
> > > updates the min-foo on disk, it should also update the cached value.
> > > That style of coding would be easier to reason about (to use an in-vogue
> > > phrase), and then we would be able to remove the re-try in
> > > svn_fs_fs__path_rev_absolute() without a concern.

> Making ffd->min_unpacked_rev stale less often does not mean that we will
> have less places to account for it *possibly* being stale:

I think it *does*, because every piece of code that knows it has a write
lock would not have to bother about that possibility.

> it could
> become stale due to other processes who write to the repository.
> (except for "pack runs under the write lock" considerations)

It could become stale due to another process writing to the repository,
except if *this* code is running under a write lock. If *this* code has
a write lock, it doesn't matter whether it's doing packing or not;
another process can't make this process's cache get stale because
another process can't get the write lock.

> To borrow Philip's point: making ffd->min_unpacked_rev stale less often
> might hide bugs.

That is one way of looking at it, when thinking about the asynchronous
(no write lock) cases, but another way of looking at the same thing is
it can help to *totally avoid* potential bugs in synchronous cases.

- Julian
Received on 2010-12-07 16:27:49 CET

This message: [ Message body ]
Next message: David Dyer-Bennet: "Re: Bug report -- space in env. var. VISUAL causes commits to fail (needs confirmation)"
Previous message: Peter Samuelson: "Re: Bug report -- space in env. var. VISUAL causes commits to fail (needs confirmation)"
In reply to: Daniel Shahaf: "Re: svn commit: r1040832 - Port a fix for a FSFS packing race"
Next in thread: Daniel Shahaf: "Re: svn commit: r1040832 - Port a fix for a FSFS packing race"
Reply: Daniel Shahaf: "Re: svn commit: r1040832 - Port a fix for a FSFS packing race"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]