On 2/17/06, Greg Hudson <ghudson@mit.edu> wrote:
> On Fri, 2006-02-17 at 02:26 -0800, Justin Erenkrantz wrote:
> > As of r18500, we now open 4 connections to the server in order to do
> > the checkout. Part of the reason for this is that as we parse the
> > REPORT response, we can start to fetch the actual content. While we
> > can keep it to one connection (and have done so until just now), we'll
> > essentially be blocked for the length of the REPORT response
>
> Aren't two connections good enough for maximum efficiency, then? If
> you're pipelining, I don't see why three connections for the GETs is
> better than one.
No. The problem with HTTP connections is that they have a very finite
lifetime. The default configuration for Apache httpd is to only allow
100 requests on a connection then close it. This is tunable by the
server admins and one of the factors we'll have to analyze is "what's
the optimal number of requests on a connection?" We currently issue
two HTTP requests for each file we request (a PROPFIND and GET). So,
that means, in the default config, we can only get 50 files per TCP
connection out of an out-of-the-box configuration. (Bumping it up to
1000 greatly increases the pipeline depth which increases the memory
usage on the client as well.)
Therefore, since the connections can go away, what we ideally want to
do is stagger our requests across connections such that the network is
always active and our 'pipeline' is full: this is why more than one
connection is needed for maximum efficiency.
> I'm actually unsure why you need more than one connection total. You
> can be sending GETs to the server as you're parsing the REPORT response,
> then start handling GET responses as soon as the REPORT response is
> done. The server should be sending data full-time, with only one
> round-trip delay at the beginning to send off the first GET request.
The problem is that the client is sitting around idle until the server
finishes the REPORT response. The REPORT takes a *long* time to
completely generate as there's a lot of data that it is sending back
(the entire list of files to fetch + some properties for each file).
(AFAICT, REPORT is largely constrained by the backend repository I/O
speed.) Also, mod_deflate and mod_dav_svn interact in such a way that
for REPORT requests that don't include full-text, it will be buffered
on the server until the REPORT response is completed (this is why
ra_serf won't request gzip compression on the REPORT - yes, this is a
nasty server-side bug; I would like to track it down,
time-permitting).
The real trick here is that there is no reason to wait for the REPORT
response to finish to start acquiring the data we already know we
need. There's no real I/O happening on the client-side - therefore,
it is more efficient to open a new connection to the server and start
pulling down the files as soon as we know we need it.
To put numbers behind it, for a checkout of an httpd 2.2.0 tarball, it
takes roughly 7 seconds from the time the server starts writing the
REPORT response until it is completely finished (with the client just
saying 'yah, whatever' and ignoring it). If we maintain just one
connection, ra_serf will indeed queue all of them up - but the server
can't begin to respond until it's done generating and writing the
REPORT.
Without multiple connections, a checkout with ra_serf would take about
35 seconds. With multiple connections, we can currently do that same
checkout in under 25 seconds. In some cases, the speed advantage is a
factor of 2 or more by acquiring files as soon as we know we need
them.
HTH. -- justin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Feb 17 16:29:21 2006