Re: Status of ra_serf

From: Justin Erenkrantz <justin_at_erenkrantz.com>
Date: 2006-02-17 21:49:25 CET

On 2/17/06, Phillip Susi <psusi@cfl.rr.com> wrote:
> Ideally you want to open one connection and send the 100 requests to
> prime the pipeline. Then JUST as the fist connection finishes with the
> last file, you want the next 100 requests to hit the server on the other
> connection. Having two connections each pulling 100 requests at the
> same time causes disk thrashing and network packet collisions, which you
> want to avoid.

For those really busy servers, they are likely a bit beefier than the
client. For example, svn.apache.org's repository is on a RAID-5'd
dual-processor machine sitting on a big fat pipe. If we were that
gun-shy, the client would be sitting idle when there's no need to do
so as the server can keep up.

I'm not personally worried about opening up 4 concurrent connections.
Again, that has been the common strategy of every web browser for
years and is a decent precedent for us to follow. (In return, we're
doing much simpler requests.)

> It would also be nice if the server would just raise that stupid cap so
> you don't have to waste time building a new TCP connection and getting
> the window open.

Right. I don't know what the 'best' value is yet. I think it's over
100 (absolutely over 10!), but I'm beginning to think 1000 might be
too high. One of the output of ra_serf may very well be, "Hey, you
server admins, tune your Apache config with these values."

At the very least, I'll ensure that svn.apache.org (and hopefully
svn.collab.net) are tuned for ra_serf. So, I'll be personally happy.
;-)

> > The problem is that the client is sitting around idle until the server
> > finishes the REPORT response. The REPORT takes a *long* time to
> > completely generate as there's a lot of data that it is sending back
> > (the entire list of files to fetch + some properties for each file).
> > (AFAICT, REPORT is largely constrained by the backend repository I/O
> > speed.) Also, mod_deflate and mod_dav_svn interact in such a way that
> > for REPORT requests that don't include full-text, it will be buffered
> > on the server until the REPORT response is completed (this is why
> > ra_serf won't request gzip compression on the REPORT - yes, this is a
> > nasty server-side bug; I would like to track it down,
> > time-permitting).
> >
>
> I would ask why does it take so long to generate the report? Maybe that
> has some room for improvement. Given that the report request takes so
> long to generate, and during that time the connection is blocked, yes,
> it would be a good idea to open another connection to download files
> that are more readily available.

My short-term goal with ra_serf is not to focus on server-side
changes. I want ra_serf to work with any 1.0+ server. If we can
later add code to optimize the server, all the better. But, ra_serf
will have zero appeal if only works against 'newer' Subversion
servers.

This also keys the performance goals: ra_serf should be competitive in
most cases with ra_dav or no one will be interested in using it. If
ra_serf is a couple of percentage points behind from ra_dav but the
server load drops by half, then that might be enough to convince folks
to switch our default to ra_serf. But, an average performance penalty
of more than a few percentage points is going to be a showstopper.
Certainly, if we can beat ra_dav in most cases, it'll be really easy
to convince people to switch. ;-)

Now, ra_dav is using really sneaky tricks to be as fast as possible:
one large REPORT isn't doing too badly in certain setups. There's a
fairly high bar here. I do think ra_serf is doing better than ra_dav
was before it switched to an update-report; so serf is allowing us to
go back to the original checkout techniques we used way back when
(before 0.25.x, I think) and get within shouting distance of ra_dav's
current 'bastardized' techniques for checkouts.

> You shouldn't need more than one extra connection though, and ideally it
> would be great if you could ask the server to begin generating the
> report in the background and spool it in a temp file, then download
> other files you know you need, THEN fetch the report from the temp file.
> That way you wouldn't need the extra connection.

Async responses aren't part of HTTP. It'd be a part of Waka though. ;-)

> > The real trick here is that there is no reason to wait for the REPORT
> > response to finish to start acquiring the data we already know we
> > need. There's no real I/O happening on the client-side - therefore,
> > it is more efficient to open a new connection to the server and start
> > pulling down the files as soon as we know we need it.
> >
>
> Aye... but again, one extra connection should be sufficient since only
> the report connection is blocked; the other connection can be kept
> pipelined.

Again, the 'fetching' connection won't live very long in the default
case. That's why we have to multiplex connections. -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Feb 17 21:49:48 2006

This message: [ Message body ]
Next message: Justin Erenkrantz: "Re: patch newline in log messages"
Previous message: Garrett Rooney: "Re: patch newline in log messages"
In reply to: Phillip Susi: "Re: Status of ra_serf"
Next in thread: Phillip Susi: "Re: Status of ra_serf"
Reply: Phillip Susi: "Re: Status of ra_serf"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]