Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]

From: Stefan Sperling <stsp_at_elego.de>
Date: Sat, 24 Oct 2009 13:14:14 +0200

On Fri, Oct 23, 2009 at 07:59:23PM -0700, Bob Denny wrote:
> Stefan --
>
> I just found this up in the thread tree, and I apologize for not
> replying sooner. This discussion thread has become confusing. I wonder
> if you read my replies to Paul, in which I lay it all out in clear
> terms. But let me respond to your response to Paul:
>
> > The problem is:
> >
> > We need to terminate the tunnel agent (ssh client) by sending sigterm.
>
> Not on windows. There is no ssh connection pooling.

Are you sure? OpenSSH supports it. As far as I know people can use it
on Windows e.g. with Cygwin.

> > By the way, Bob, maybe you are running a version of sshd affected by that
> > bug? If so, could you try updating sshd and see if that solves the problem,
> > without your patch applied?
>
> I don't know. The sshd is on a remote service (A2 Hosting). And I
> don't care, because the problem is the harsh killing of the
> local/client tunnel agent (e.g. PLink). That's what starts it all.

If you are testing this patch against a remote hosting service, how can
you be sure that nothing bad happens on the server side without killing
the client?

I can live with the idea that simply closing the file descriptor should
terminate the ssh client, and it in turn should close the TCP socket.
The whole machinery should exit gracefully if this is done. On any OS.

But the fact that we have been sending APR_KILL_ALWAYS to the ssh
client *for years* seems to indicate that there is a problem with
not sending this signal. Which problem it might be is uncertain,
all information I have is this comment:

   * Closing the pipes and waiting for the process to die
   * was prone to mysterious hangs which are difficult to
   * diagnose (e.g. svnserve dumps core due to unrelated bug;
   * sshd goes into zombie state; ssh connection is never
   * closed; ssh never terminates).

I'd rather be careful when changing the default to something that
is known to have caused problems in the past.

But who knows, maybe the problems people were seeing back then were
phantoms and we can switch to APR_KILL_NEVER on any platform without
causing trouble.

> Did you read this:
>
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2410204
>
> If you instantly kill the tunnel agent, EVERYTHING runs through error
> paths, and depending on timing (client CPU and network latency) the
> remote sshd and its svnserve child can be left hanging.

Yes, that's bad. And we don't want to instantly kill the tunnel agent.
We don't do it on UNIX anymore (since 1.6.5). When I made the change I
was unaware of the fact that the situation on windows didn't change
because APR does not (or cannot) handle SIGTERM on windows.
But from reading the APR code it seems there is no difference at all
between APR_KILL_ONLY_ONCE and APR_KILL_ALWAYS on windows. So the
problem you are trying to fix should have existed pre-1.6.5. Can you
confirm this, if only to help me make sure I've understood the problem?

It seems we have no other choice on windows than to just close the fd
and hope for the best. What I'd like to be 100% certain about is that
not killing the client will not cause any problems on the server side.
If it does, we need to consider this too and maybe amend or extend your
solution. The comment above hints at server-side problems not killing
the client might cause ("sshd goes into zombie state").

Since I cannot reproduce any of this and have very little experience
with windows in general I cannot make an informed decision.
Rather instead of me second-guessing what's best for windows, I'd like
one of our Windows developers (e.g. Bert or Paul) to take a look at
this problem.

> I'm reaching the end of my limits trying to explain a problem from the
> standpoint of a Windows developer to this group who are clearly rooted
> in the Linux world.

Please stop saying such things. It is not the reason why digesting your
patch takes a long time. The problem is complicated and as it stands we
need more information to make an informed decision, that's all.

Repeatedly telling people in the open source community they were intolerant
to your choice of OS is bad form and will cause the dividing effect you are
trying to avoid, but not because people don't agree with your choice of OS.

> I can see absolutely no reason to "terminate the tunnel agent" on
> Windows. At least not the tunnel agents PLink.exe, TortoisePLink.exe,
> and an ssh.exe I got from "somehwere". All exit gracefully when used
> (at least) from svn.exe, from TortoiseSVN, from SVN for Dreamweaver
> (DW GUI plugin), and from VisualSVN (a Visual Studio GUI plugin).
> Furthermore, they run as children of the SVN program or GUI plugin
> anyway, and if that exits, Windows takes out all children.

OK, I believe that, and that's quite an amount of client coverage
in your testing which is very good. Any problems at the server's end?

> As it is now, subversion 1.6.6 is unusable by me.

That's bad and we need it fixed ASAP.

> My patched version
> is running nicely, no problems with my tunnels.

That's a good indicator that you're going in the right direction.

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2411017
Received on 2009-10-24 13:14:51 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: [PATCH] addcommand line option to svn-backup-dumps.py to specify svnadmin and svnlook paths"
Previous message: Martin Hauner: "mingw32 test failure: svn_dirent_is_canonical"
In reply to: Bob Denny: "RE: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]"
Next in thread: Bob Denny: "RE: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]"
Reply: Bob Denny: "RE: Re: Re: Issue #2580 revisited: Windows unclean TCP close [SEVERE]"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]