[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn commit: r1040832 - Port a fix for a FSFS packing race

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Wed, 08 Dec 2010 12:51:35 +0000

On Wed, 2010-12-08, Daniel Shahaf wrote:
> [ continuing the non-linear replies ]
>
> Daniel Shahaf wrote on Tue, Dec 07, 2010 at 18:51:12 +0200:
> > Julian Foad wrote on Tue, Dec 07, 2010 at 15:27:05 +0000:
> > > On Tue, 2010-12-07 at 12:13 +0200, Daniel Shahaf wrote:
> > > > Philip Martin wrote on Tue, Dec 07, 2010 at 09:49:40 +0000:
> > > > > Julian Foad <julian.foad_at_wandisco.com> writes:
> > > > > > On Mon, 2010-12-06 at 18:44 +0000, Philip Martin wrote:
> > > > > >> Julian Foad <julian.foad_at_wandisco.com> writes:
> > > > > >>
> > > > > >> > I'm going to drop this "Remove the re-try logic from
> > > > > >> > svn_fs_fs__path_rev_absolute()" follow-up patch, as I don't want to get
> > > > > >> > into checking or messing with the txn-correctness of FSFS now. If
> > > > > >> > Daniel or anyone else wants to pursue it, I'd be glad to help.
> > > > > >>
> > > > > >> I thought the patch was an improvement. The existing retry in
> > > > > >> svn_fs_fs__path_rev_absolute doesn't fix the race, it simply makes the
> > > > > >> window smaller. As the patch makes the window larger we are more likely
> > > > > >> to see and fix any places where the caller doesn't handle the race.
> > > > > >
> > > > > > I *think* the problem is that a caller may use this function within a
> > > > > > transaction and depend on this internal re-try logic simply to update a
> > > > > > stale cached min-unpacked-foo value. "Stale" in the sense that *this*
> > > > > > transaction has changed the real value and hasn't updated the cached
> > > > > > value.
> > > >
> > > > Yes, it's conceivable that a caller might do:
> > > >
> > > > for (int i = 0; i < 2; i++)
> > > > {
> > > > ...
> > > > SVN_ERR(svn_fs_fs__path_rev_absolute());
> > > > ...
> > > > if (i == 0)
> > > > /* other process runs svn_fs_pack(); */;
> > > > [...]
> > >
> > > No, that's not what I mean. I mean a caller might do:
> > >
> > > with a write lock:
> > > {
> > > svn_fs_pack();
> > > SVN_ERR(svn_fs_fs__path_rev_absolute(&path, ...));
> > > foo(path);
> > > }
> > >
> > > where neither this code nor foo() re-tries. With the current version of
> > > svn_fs_fs__path_rev_absolute(), foo() will get the correct path,
> >
> > Unless a commit occurs after svn_fs_fs__path_rev_absolute() returns but
> > before foo() is called. (This is the condition Stefan2 fixed.)

I've already clarified this on IRC, but my example code was bad. I
meant to indicate that the whole sequence is inside a write lock, but it
should say pack_body() where I wrote svn_fs_pack(), since I now
understand that svn_fs_fs__pack() takes a write lock itself and
therefore it's wrong to try to call svn_fs_pack() while holding a write
lock.

> > > whereas if we remove the internal re-try then foo() will get the wrong
> > > path.
> > >
> > > > If there are any
> > > > > such cases then your patch should expose them and we will fix them. Any
> > > > > such errors are easy to handle, compared to the difficulty of fixing
> > > > > races.
> > > > >
> > > > > The real problem with the existing code is that it partially hides
> > > > > races, by making them harder to trigger, but doesn't fix the race. If
> > > > > we want to fix the race properly making it easier to trigger is the
> > > > > right thing to do.
> > >
> > > Agreed.
> > >
> > > What I meant to indicate is that, before making this change, I would
> > > like to see clearly that a caller with the write lock cannot use an
> > > out-of-date value. One way to ensure that would be:
> > >
> > > update the cached value whenever we get the write lock (which we do);
> > > update the cached value whenever we update the value on disk (which we
> > > don't);
> > > anything not using a write lock must accept that the cached value may
> > > be stale, and re-try if necessary.
> > >
> > > Another way would be:
> > >
> > > don't update the cached value when we get the write lock;
> > > don't update the cached value when we update the value on disk;
> > > every time we use the cached value within a write lock, either update
> > > before using it or try and re-try if necessary;
> > > every time we use the cached value without a write lock, try and
> > > re-try if necessary.
> > >
> > > I don't have a strong preference between those, but the current
> > > inconsistency between updating it when we get the write lock and not
> > > updating it when we update the value on disk seems bad.
> > >
> >
> > What is the cost of this inconsistency?
> >
> > The stale cache will result in a failed stat() call on the path where
> > a revision file existed before it was packed. Then, either we will
> > update the cache and retry (meaning the only cost was a stat() call), or
> > due to a bug we will forget to retry and, instead, send a bogus
> > error up the stack.
> >
> > So, we pay one stat(), but in exchange we hope that, by letting the
> > cache *actually* become stale, concurrency bugs that reproduce only
> > when the cache is stale will be unearthed earlier.
>
> In more words: bugs that only trigger under asynchronous conditions are
> more likely to escape our detection. (For example, many of them weren't
> detected by 'make check'.) By forcing the code to exercise the 'cache is
> stale' situation, we attempt to cause some of those bugs to be discovered
> during development, rather than during RC testing or after release.
>
> Makes sense?

One part of me says "yes that sounds logical", another part is still not
sure.

> >
> > In fact, how about this patch?
> >
> > @@ -7638,6 +7639,9 @@ pack_shard(const char *revs_dir,
> > SVN_ERR(write_revnum_file(fs_path, PATH_MIN_UNPACKED_REV,
> > (svn_revnum_t)((shard + 1) * max_files_per_dir),
> > iterpool));
> > +#ifndef SVN_DEBUG
> > + SVN_ERR(update_min_unpacked_rev(fs, iterpool));
> > +#endif
> > svn_pool_destroy(iterpool);
> >
> > /* Finally, remove the existing shard directory. */
> > @@ -7701,6 +7705,9 @@ pack_revprop_shard(svn_fs_t *fs,
> > SVN_ERR(write_revnum_file(fs_path, PATH_MIN_UNPACKED_REVPROP,
> > (svn_revnum_t)((shard + 1) * max_files_per_dir),
> > iterpool));
> > +#ifndef SVN_DEBUG
> > + SVN_ERR(update_min_unpacked_revprop(fs, iterpool));
> > +#endif
> > svn_pool_destroy(iterpool);
> >
> > /* Finally, remove the existing shard directory. */

My intuition says no, making it differ in debug mode versus release mode
is more likely to cause us pain than gain, overall.

Studying the FSFS source code for the issues raised in this thread has
given me confidence that it seems to be doing the right thing, in
practice, at the moment.

In 1043360 I added some comments about how to remove one source of
fragility in using the cached value. Again I'll say this looks
perfectly correct in its current usage.

I'll go ahead and apply the remove-retry-logic-from-path-rev-absolute
patch now.

- Julian
Received on 2010-12-08 13:52:17 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.