----- Malcolm Rowe <firstname.lastname@example.org> wrote:
> On Tue, May 30, 2006 at 08:34:46AM -0400, Greg Hudson wrote:
> > On Tue, 2006-05-30 at 02:21 +0100, Malcolm Rowe wrote:
> > > As far as fixing it goes, the right fix is probably to only use a
> > > file handle per revprop file, open for the lifetime of the
> > > (or transaction root, actually).
> > I don't necessarily agree with this, incidentally. The right fix
> may be
> > to determine when mod_dav_svn isn't properly aborting a commit on a
> > write failure and to fix that.
> That might fix this particular caller, but it won't fix the same
> problem occuring in the future, and this problem is
> destructive, and also insidious (it's typically only visible once you
> get to a revision that wants to use the corrupt revision as a delta
> I'd rather fix the problem definitively in FSFS, especially since we
> have no clue as to what the cause is (it may not even be in
Once again, I agree. :-) This problem, while not widespread, does affect those who see it fairly often. Most everyone that I have helped has several instances of the problem (3-5 occurences). They're still to infrequent to be useful in determining a reproduction recipe though.
> > Alternatively, if it's desired to make
> > FSFS more resilient against malfunctioning callers, the right fix
> may be
> > to mark transactions as dead when write errors occur, and refuse to
> > complete them.
> That would be my preferred approach except for two problems. The first
> is that there's no clean way of doing it - we'd need to check for errors
> (and which errors?) at some boundary - exported functions in fs_fs.c,
> perhaps. The second problem is that APR < 0.9.7 doesn't report write
> errors _at all_ for buffered files, making this check less than useful.
> I'm loathe to implement this until we have a report with APR >= 0.9.7.
Without the write errors, I'm not sure how we'd go about doing this cleanly. Perhaps the best answer is that on APR < 0.9.6, we just open the files in unbuffered mode? This, of course, doesn't help the potential issue with ra_serf.
What if we punt on APR < 0.9.7? Meaning, you can't use FSFS on anything less than APR 0.9.7. It doesn't help those suffering from the problem now, but not getting errors during the write operation is obviously a terrible thing.
> There doesn't really seem to be a 'good' solution to this problem, so
> I'm currently looking for the least-worst solution. The least-worst
> approach that I can currently think of is to hold a reference-counted
> map of root objects to globally-allocated 'root object shared data',
> in which the file handle for the revision is stashed.
> That's clearly not trivial, especially because we're actually opening
> _more_ chances for the problem to occur, unless we ensure that only
> one handle to a proto-revprop file can ever be open. I'd therefore
> also suggest that we open the revprop file exclusively, preventing two
> different processes from opening a root object to the same
Just double checking, but this involves using flock() semantics, which means it would also work for the threaded MPM, right?
> It's also possible that the problems that were being discussed earlier
> (in the context of ra_serf) are somehow manifesting here. If so, this
> would prevent them.
> > (This is assuming the problem really is bad logic in mod_dav_svn. We
> > don't have a complete understanding of what's going wrong in these
> > corruption cases because we can't reproduce them.)
Absolutely. There is nothing definitive except that mod_dav_svn has been the only one to tickle the problem. Whenever I've eluded to there being an issue with mod_dav_svn,
I've really meant that there appears to be an issue with FSFS and mod_dav_svn together. That could be because it's the only one who can operate in a fashion to cause the problem, or it could be a function of mod_dav_svn. I lean towards the former, but definitely would not rule out the latter.
Unfortunately, I don't think we're going to ever see a reproduction recipe. The problem is infrequent, and I believe a series of events needs to fall into place in order for the problem to present itself. OTOH, ra_serf and pipelining bring about similar issues (in that there may be more than one writer to the prototype revprop file), and that problem is a little more reproducible. :-)
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Wed May 31 10:06:25 2006