[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: zombie ssh processes

From: Kyle McKay <mackyle_at_gmail.com>
Date: Fri, 27 Mar 2009 11:59:25 -0700

On Mar 27, 2009, at 09:51, Hyrum K. Wright wrote:
> On Mar 27, 2009, at 11:21 AM, Kyle McKay wrote:
>
>> From the xcode-users mailing list:
>>
>>> From: Chris Espinosa <cde_at_apple.com>
>>> Date: March 26, 2009 15:39:14 PDT
>>> To: Xcode Users <xcode-users_at_lists.apple.com>
>>> Subject: Re: Xcode 3.1.2 and Subversion 1.6
>>>
>>> On Mar 25, 2009, at 3:19 PM, Chris Espinosa wrote:
>>>> On Mar 25, 2009, at 3:07 PM, Rob Lockstone wrote:
>>>>
>>>>> Has anyone tried using Xcode 3.1.2 with the subversion 1.6.0
>>>>> client? I think I recall (but may be wrong) that newer versions of
>>>>> Xcode don't make assumptions about the version of subversion
>>>>> that's installed and simply use whatever version it finds.
>>>>
>>>> We have not yet qualified any version of Xcode with Subversion
>>>> 1.6.0 and don't recommend replacing existing Subversion library or
>>>> client code with 1.6 until we've given it the green light.
>>>
>>> We've discovered in internal testing that this patch in Subversion
>>> 1.6:
>>>
>>> http://svn.collab.net/viewvc/svn?view=revision&revision=35533
>>>
>>> can cause Subversion 1.6 to leave behind zombie ssh processes every
>>> time you save a file in Xcode, and eventually exhaust your ability
>>> to spawn new processes. We don't recommend using Subversion 1.6
>>> with Xcode 3.1.x at this time.
>>>
>>> Chris
>>
>> From:
>>
>> http://svn.apache.org/repos/asf/apr/apr/tags/1.0.0/include/apr_thread_proc.h
>>
>> APR_KILL_NEVER // process is never sent any signals
>> APR_KILL_ALWAYS // process is sent SIGKILL on apr_pool_t cleanup
>> APR_KILL_AFTER_TIMEOUT // SIGTERM, wait 3 seconds, SIGKILL
>> APR_JUST_WAIT // wait forever for the process to complete
>> APR_KILL_ONLY_ONCE // send SIGTERM and then wait
>>
>> Restoring the apr_pool_note_subprocess and using APR_KILL_NEVER would
>> allow the children to be reaped provided they exit before pool
>> cleanup.
>>
>> However, that would likely not eliminate the zombie problem in Xcode
>> as pool cleanup probably happens faster than ssh cleanup and exit in
>> some cases. How about using APR_KILL_AFTER_TIMEOUT or
>> APR_KILL_ONLY_ONCE (or even APR_JUST_WAIT) ?
>
> How would this interact with ssh connection pooling? The case which
> drove r35533 was a user who uses ssh connection pooling for svn
> connections. Having svn kill the ssh connection is obviously
> hazardous to such a scheme, how would using the other APR_KILL_*
> conditions behave there (and would they fix the problem with XCode)?
>
> -Hyrum

I used the following configuration in ~/.ssh/config for the following
TESTs:

ControlMaster auto
ControlPath /tmp/sshpool-%l-%r@%h:%p

Also ~/.ssh/id_rsa.pub key has been added to ~/.ssh/authorized_keys so
that ssh localhost works without requiring any passwords.

It seems that the master and slave ssh connections are only related
through the unix socket. So when the master exits for whatever reason
(SIGHUP, SIGINT, SIGTERM, SIGKILL etc.) the unix socket goes away and
all the slave ssh connections die.

------
TEST 1
------
Suppose you do the following (with ssh configured as mentioned above):

ssh localhost sleep 15

And wait 15 seconds. The ssh process exits normally.

------
TEST 2
------
Do this:

ssh localhost sleep 15

And within 15 seconds, go to another window/terminal/screen and do this:

ssh -t localhost top # The "-t" option is important here

If you go back to the first window, you'll notice that even after 15
seconds the first ssh process doesn't exit. Go back to the window
running top and press "q". You should see a message about "Shared
connection to localhost closed." and if you now go back to the first
ssh process, you'll see that it has exited normally.

------
TEST 3
------
Do this:

ssh localhost sleep 15

And within 15 seconds, go to another window/terminal/screen and do this:

ssh localhost top # DO NOT USE "-t" THIS TIME

If you go back to the first window, you'll notice that even after 15
seconds the first ssh process doesn't exit. Go back to the window
running top and press Ctrl-C. The second ssh process should exit, but
you will not see the message about closing the shared connection. If
you go back to the first ssh process, you'll see that it's still
waiting to exit -- the abnormal exit of the second ssh process
prevents the first (master) ssh process from noticing and it will
never close now unless you send it one of SIGHUP, SIGINT, SIGTERM,
SIGKILL etc.

-----------
CONCLUSIONS
-----------
1. For ssh connection pooling with ControlMaster/ControlPath to work,
the ssh master process must not receive any kind of hup/interrupt/quit/
terminate signal.
2. If any of the slave ssh connections fail to exit normally, the
master ssh connection will never exit without some kind of signal sent
to it.
3. ASSUMPTION: The probability of some slave ssh connection exiting
abnormally is low, but > 0.
3. ASSUMPTION: The probability of Subversion's ssh connection being
the master connection is > 0.
4. Given #1, #2, #3, and #4 Subversion will sooner or later leave
behind a running ssh master connection.

5. Omitting the apr_pool_note_subprocess call prevents Subversion from
reaping its children. This will ALWAYS result in zombies unless there
has been a signal(SIGCHLD, SIG_IGN) call previously (such a call would
be highly unfriendly to any program linked with the Subversion library).
6. When Subversion is accessed directly via the API from the
Subversion library, it may be part of a long-running process that
persists across many Subversion operations.
7. Given #5 and #6 the number of zombies created may eventually
overwhelm the system resources and exhaust your ability to spawn new
processes (this is what's happening with Xcode, but could happen to
any long-running program that links with the Subversion library).

8. The only apr_pool_note_subprocess option that does not send any
signals is APR_KILL_NEVER. However it will still reap zombie children
provided they have exited by pool cleanup time.
9. ASSUMPTION: The probability of svn's ssh connection exiting after
the pool cleanup is low, but > 0.
10. Given #8 and #9 svn will still probably create a few zombies even
with APR_KILL_NEVER but likely this will be a far smaller number than
without any apr_pool_note_subprocess call at all.

11. You can't have it both ways (never create zombies or stray ssh
processes AND support ssh connection pooling) without some kind of
configuration option as the required signaling behavior is mutually
exclusive.

---------------
RECOMMENDATIONS
---------------
1. Short term. Revert change 35533 and then change the
APR_KILL_ALWAYS to an APR_KILL_NEVER. This will likely eliminate most
of the zombie problems for long-lived processes using the Subversion
library while remaining compatible with ssh connection pooling
(ControlMaster/ControlPath).
2. Longer term. Add an option to ~/.subversion/config file ([tunnels]
section?) that lets you select the apr_kill_conditions_e value passed
to apr_pool_note_subprocess with it defaulting to APR_KILL_NEVER if
not given (apr-kill-condition = ... ?).

Kyle

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1447379
Received on 2009-03-28 11:57:27 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.