On Thu, Oct 12, 2006 at 12:56:11PM -0400, Paul Burba wrote:
> 2) Somehow the problem is tied to the basic_corruption test. If that test
> is not run all the other tests pass.
>
This is what I'm fairly sure is happening:
1. Whatever the basic corruption test ends up doing on the server side
with the first commit is leaving us with a transaction that we've
started to write to, but then neither finished writing nor aborted,
which is what the FSFS fix was designed to prevent further updates to.
2. There does seem to be a bug in the transaction-locking code that
basic_corruption is hitting, since the second attempt to commit is
getting killed because we think that a transaction with the same ID is
still active (it may well have an open file handle to the proto-rev file
from the first attempt, but I bet we've blown away the transaction
directory by this point, so it is supposed to have forgotten about the
transaction).
3. Having said that, the thing that's really causing a problem (and
causing all the remaining tests to fail) is the peculiar method we're
using to generate our repositories in the test suite means that all the
repositories we open have the same UUID. This is actually quite bad -
the FSFS code at least uses UUID uniqueness to track per-repository
information (like the fs-wide write lock and the per-txn lock you've
encountered here), so all the transactions have the same UUID/txnid.
And why doesn't this trigger on Cygwin or Linux? Because svnserve on
those OSs uses forking model rather than the threaded model by default,
so the intra-process lock (keyed on UUID/txnid) is coming into play
rather than the fcntl() inter-process lock - which works just fine in
this situation.
I can reproduce this now by running svnserve in threaded mode on Linux.
So:
1. We should look at why the second commit in basic_corruption is
failing (or rather, why the transaction hasn't been forgotten when we
purged the transaction directory from the first, aborted, commit. I
might be able to do that sometime before the end of the Summit.
2. We should re-set the UUID on the new repositories we generate in the
test suite immediately after we hotcopy them. Though offhand, I'm not
sure how we'd achieve this.
3. I think we should actively look to prevent opening 'different'
repositories with the same UUID, since we are (and have always been)
using the UUID to determine filesystem identity in FSFS. Is there a
reliable way to determine if two paths point to the same file in APR?
(it exposes an inode, doesn't it? does anyone know if that works on
Windows and for NFS-mounted files?) If so, we could fail attempts to
open a filesystem that had the same UUID and different inode to an
already-open filesystem. I _think_ this approach is okay w.r.t.
backwards compatibility - anyone have any objections?
4. It would be nice for the client to abort the transaction for that
failing first commit in basic_corruption, if it's not already, and if
it's not too hard to do.
5. I should have tested using svnserve in threaded mode.
Regards,
Malcolm
- application/pgp-signature attachment: stored
Received on Fri Oct 13 12:50:38 2006