RE: Issue #2580 revisited: Windows unclean TCP close [SEVERE]
From: Bob Denny <rdenny_at_dc3.com>
Date: Thu, 22 Oct 2009 05:47:00 -0700 (PDT)
> I did read this entire thread ... the problem is the linux sshd/child
I'm well aware of these issues, particularly the TCP connection termination process. I may not have been clear enough before: by instantly killing the tunnel process, the proper termination of the TCP connection is prevented (not to mention the proper termination of the SVN protocol!). The Ethernet traces I posted show the node where the tunnel USED to be running sending a TCP RESET packet. The TCP connection was IMPROPERLY terminated due to the tunnel (the client end of the TCP connection) vaporizing before the process could run to clean completion.
> Under this situation, I would not advocate changing the windows
Do I understand you to be saying that it's sshd's error? How can it tell the difference between "dead partner" and "temporarily slow or dead net connection"? All that TCP stuff is running a couple of layers down under the sockets API that sshd uses. So sshd properly sits there for "a while", with its child svnserve also sitting there, both waiting for some word from the other end. Eventually (10+ MINUTES later) sshd DOES exit and terminate its child. But this process is one of "forget it, something's really wrong", I'm outa here...
Instantly killing the tunnel (clearly the WRONG way to terminate a TCP connection!) sets off a chain of events, ALL OF WHICH ARE ERRORS. The svn protocol does not complete normally, nor does the ssh protocol, nor does the TCP protocol! Do we want EVERY SINGLE svn+ssh connection to complete through error paths? That's what's happening now, to EVERYONE who is using svn+ssh from a Windows client! I've already explained why we haven't gotten more complaints.
But I repeat: instantly killing the tunnel upon receipt of the last SVN data from a remote svnserve is NOT the way to shut down the connection between the two. It leaves the remote sshd wondering what happened and sitting there waiting for the next step in ITS protocol (SSH). It has no idea what is coming next (more data or what???). So it sits there waiting...
The recent change to use APR_KILL_ONCE is (imo) the right way to do it, but it works (sending SIGHUP to the tunnel) only on Linux, etc. I suspect the authors didn't realize that APR on Windows converts KILL_ONCE to KILL_ALWAYS and instantly kills the child (unlike the real KILL_ONCE on Linux, which sends SIGHUP to the tunnel, then waits for 3 seconds to allow things to wind down properly, then kills it if it hasn't already exited normally).
> Also, to quote your original post -- "sshd goes into bozo mode and leaves
sshd eventually does that. But only after a long time, for the reasons given above.
This is an archived mail posted to the Subversion Dev mailing list.