Thanks, Brandon, this is a very good analysis. And it confirms my
suspicions that we're using _way_ too many transactions, and issuing far
too many txn_checkpoint calls.
I think *the* major task for 0.19 is:
* Create a DB monitor that can detect crashed sessions and
automagically unwedge the DB.
* Stop creating transactions for read-only requests, and use
ordinary locks instead.
* Reduce the number of txn_checkpoint calls in our code, or even
eliminate them completely.
Before amyone starts wondering if I'm off my rocker, consider this: you
only really need a txn_checkpoint when youre doing a hot backup of the
database, or removing old log files. Therefore, checkpoints should be
issued by the backup/cleanup scripts, definitely not in the critical path.
I actually think moving the checkpointing out of the main code is the
simplest of the three.
Brandon Ehle wrote:
>>> I'm in favor of committing this change. I even volunteer to test it.
>>> Without it, my ra_svn tests frequently hang.
>> isn't that just masking whatever the real bug is? i mean
>> checkpointing more often shouldn't be causing a problem, and if it
>> is, we need to figure out why, not ignore it and hope it goes away.
> I've been tracking down this issue for about 3 months now and here is
> my guess on whats happening.
> Pretty much every svn operation touches the database in some way or
> another. Even an svn update when nothing has changed in either your
> working copy or the repository, so every operation will put the
> repository in a state where txn_checkpoint() has something to do.
> Therefore, txn_checkpoint() will get run after every single operation
> (in ra_dav mode this includes every PUT).
> Normally this isn't too bad, but as your repository grows, the
> checkpoint times will get larger and larger and eventually you could
> get to the point where my 15GB repository is at and a txn_checkpoint()
> takes 5 minutes or more.
> Any operations on the database after this point will wait in
> __os_yield() for a short period of time until the checkpoint has
> released its lock on the shared memory for the last log file, which is
> needed for quite a few operations. This is the reason why it appears
> why the subversion call stack gets stuck in __os_yield(). If it takes
> more than 90 seconds for txn_checkpoint() to release its locks, thats
> when you see the neon timeouts over ra_dav.
> As alot of small operations are running on the database in ra_dav
> mode, the repository can get into a state where it needs to run 2 or 3
> txn_checkpoints() in a row. This will easily cause the 90 second
> neon timeout. The txn limiting patch attempts to limit the number of
> checkpoints that will run in a row under these circumstances, although
> it is still very possible to get timeouts if it takes your machine
> more than 90 seconds for a single txn_checkpoint() to release its locks.
> For a multi-user ra_dav server, another fun part of this problem is
> that only one txn_checkpoint() can run at a time, so as each operation
> wants to run txn_checkpoint(), and if you have enough users,
> eventually every apache thread will be waiting for a turn to run
> txn_checkpoint() so apache will have to spawn some more processes (if
> it can). If your apache server is stuck in this mode and you attempt
> to shut it down, it could take on the order of several hours until
> apache finishes shuttting down. The txn limiting patch helps, but
> does not completely address this issue (you should be able to run
> about 4x as many users on your server with the patch applied).
Brane Čibej <brane_at_xbc.nu> http://www.xbc.nu/brane/
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Thu Feb 20 06:37:44 2003