>> I'm in favor of committing this change. I even volunteer to test it.
>> Without it, my ra_svn tests frequently hang.
> isn't that just masking whatever the real bug is? i mean
> checkpointing more often shouldn't be causing a problem, and if it is,
> we need to figure out why, not ignore it and hope it goes away.
I've been tracking down this issue for about 3 months now and here is my
guess on whats happening.
Pretty much every svn operation touches the database in some way or
another. Even an svn update when nothing has changed in either your
working copy or the repository, so every operation will put the
repository in a state where txn_checkpoint() has something to do.
Therefore, txn_checkpoint() will get run after every single operation
(in ra_dav mode this includes every PUT).
Normally this isn't too bad, but as your repository grows, the
checkpoint times will get larger and larger and eventually you could get
to the point where my 15GB repository is at and a txn_checkpoint() takes
5 minutes or more.
Any operations on the database after this point will wait in
__os_yield() for a short period of time until the checkpoint has
released its lock on the shared memory for the last log file, which is
needed for quite a few operations. This is the reason why it appears
why the subversion call stack gets stuck in __os_yield(). If it takes
more than 90 seconds for txn_checkpoint() to release its locks, thats
when you see the neon timeouts over ra_dav.
As alot of small operations are running on the database in ra_dav mode,
the repository can get into a state where it needs to run 2 or 3
txn_checkpoints() in a row. This will easily cause the 90 second neon
timeout. The txn limiting patch attempts to limit the number of
checkpoints that will run in a row under these circumstances, although
it is still very possible to get timeouts if it takes your machine more
than 90 seconds for a single txn_checkpoint() to release its locks.
For a multi-user ra_dav server, another fun part of this problem is that
only one txn_checkpoint() can run at a time, so as each operation wants
to run txn_checkpoint(), and if you have enough users, eventually every
apache thread will be waiting for a turn to run txn_checkpoint() so
apache will have to spawn some more processes (if it can). If your
apache server is stuck in this mode and you attempt to shut it down, it
could take on the order of several hours until apache finishes shuttting
down. The txn limiting patch helps, but does not completely address
this issue (you should be able to run about 4x as many users on your
server with the patch applied).
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Wed Feb 19 22:46:19 2003