Optimizing properties for checkout/update with ra_serf

From: Justin Erenkrantz <justin_at_erenkrantz.com>
Date: Fri, 8 Jun 2012 10:28:31 +0200

To help kick off the hackathon discussions next week in Berlin, I'd
like to nudge the collective braintrust by writing down some thoughts
about optimizing the ra_serf protocol used when checking/updating out
a working copy. (I should arrive in Berlin late on Monday.)

As Philip has pointed out, there is still a gap between ra_serf and
ra_neon about what they send on the wire. To be clear, the
differences about what they send on the wire have little to do with
the intrinsic differences between serf and neon - it has far more to
do with the RA layers themselves. In theory, we could probably make
ra_serf and ra_neon send exactly the same bits on the wire if we
really wanted to do so.

Let me recap what it is that both RA layers do right now at a high level:

ra_neon

---
ra_neon issues a REPORT call to the server indicating what local
revision it has in the body of the request and it also sets the
"send_all" flag to true.
Then, in the response, mod_dav_svn will do a number of things:
 - The server will produce an XML document listing what changes need
to happen locally on the client.  (In this way, the response specific
to the client's version indicated in the request body.  No amount of
HTTP caching can therefore help with REPORT request.)
 - Inside the XML response, two things of note happen due to send_all
being set in the REPORT request:
   1. For file content, the server will inline the contents of the
file via svndiff - this is regardless if it is fulltext or deltas -
the behavior is the same.
   2. For every properties on the affiliated file entry, the server
sends remove-prop or set-prop with the associated key/value.
Since the response is embedded inside of an XML document, we must
base64-encode the resulting svndiff and potentially the property
values (if it is not XML-safe).  Roughly speaking, base64 will add
about 20-30% space overhead.  If you are willing to do
double-compression by using mod_deflate, you'll mitigate some of that
space overhead at the cost of CPU by running zlib again over the
base64-encoded svndiff (which uses zlib anyway).
ra_serf
---
ra_serf issues a REPORT call to the server indicating what local
revision it has in the body of the request - but, unlike ra_neon, it
does not set the send_all flag.
So, in response to the request, mod_dav_svn does:
 - Just like it's response to ra_neon, the server will produce an XML
document listing what changes need to happen locally on the client.
 - However, it does not inline the content or the property values.
This is left for ra_serf to handle separately.
 - ra_serf will then parse the REPORT response - which is
substantially smaller than what ra_neon has to do.  It then opens up
to 4 HTTP connections to the server and does for each file:
   - Issue a GET with the local version number in the request headers
as well as that it'd prefer svndiff responses.  mod_dav_svn can then
send back a svndiff version if it chooses to...or, it will send a
plaintext version.  (N.B. This request is easily cacheable by stock
HTTP edge caches and proxies.)
   - Issue a PROPFIND for the file.
Optimizing ra_serf: content and properties
---
So, ra_serf will issue 2 HTTP requests for each file in
checkout/update.  In practice, the PROPFIND requests/responses are
very small.  If your httpd's logging infrastructure isn't tuned (ie,
logging to a slow disk and/or not tuned properly), you may notice a
slowdown in synthetic checks due to the increased number of responses.
 There is still additional traffic here by having those extra HTTP
requests...and that's where we can further optimize things.
A thread a little while back indicated that ra_serf won't compress
things by default whereas ra_neon does.  This is due to the fact that
ra_serf doesn't do the compression inline - it relies upon mod_deflate
(standard module in httpd) to do it while ra_neon always does the
compression as well as the base64 encoding.  So, if you don't do any
tuning whatsoever, ra_neon will always send smaller responses than
ra_serf.  But, when compression is enabled, ra_serf currently sends
about 1.2x data compared to ra_neon.  We can do better...
Recent optimizations in ra_serf (but *not* ra_neon) should attempt to
not GET a file that the local client already has in its pristine
database.  (Think about what happens on a checkout with
branches/tags/etc.; I've found this to be a pretty common occurrence
at least in my workflow!)  It would be much tougher (if not
impossible) for this optimization to occur in ra_neon as the
server-side does all of the logic and doesn't know what the client may
or may not have.  In actuality, due to the PROPFIND requirement,
ra_serf still issues a HEAD request - but, we don't need the actual
content saving a huge bunch of bandwidth on an update case if the
pristine already exists locally.  This is a huge win for ra_serf and I
don't think we'll be able to do much better - we *have* to get the
content somehow.  (As we eventually move to a global pristine store
with libsvn_wc, it'll get even better!)   And, these GET/HEADs are
easily cacheable, so distributing load on the server side should be
fairly straightforward by dropping in a dumb HTTP cache.  By skipping
the base64 overhead and spreading it across multiple connections (to
take better advantage of beefier servers) and using pipelining, I
think overall we're doing about as well as we could here.
Regarding the 2 HTTP requests, the key bit here is that there is no
way for us to get the properties and the content in one call.
However, there is a mechanism to optimize the PROPFIND that Greg has
suggested that does not require any server-side changes: have ra_serf
issue a PROPFIND call on each directory in the WC with a Depth: 1
header.  mod_dav_svn will then respond in one HTTP request all of the
properties for all of its children.  This reduces the number of HTTP
requests from roughly 2-per file down to 1-per file and 1-per
directory.
On the client-side...and why I didn't implement this way back in
2006...is that the PROPFIND/Depth: 1 introduces some complexity on the
client due to the editor API.  Given the way that our editor drives
work with libsvn_wc (including the Ev2 rewrite), we will almost
certainly need to keep the properties per-directory around until we
are ready to process the file contents.  In the worst case, I think it
is possible that we could have *all* of the properties for *all* of
the files before we even start to process the files themselves.  We
might be able to play tricks by spooling properties to disk or
delaying the PROPFIND until we start fetching the files (if even
needed)...but, ugh.  I do also wonder if we only do the Depth: 1 when
we add a directory - or could we aggregate it when we have multiple
files updated in the tree (but not all updated!).
Anyway, that's where my head is at.  I just don't have a clear picture
on what the ra_serf side will look like with PROPFIND/Depth: 1 yet.
I hope this helps frame the conversation a bit.  For those of you who
will be in Berlin, see you soon - and for those of you not in Berlin,
we'll ensure to writeup whatever we discuss and throw it on list.  --
justin

Received on 2012-06-08 10:29:09 CEST

This message: [ Message body ]
Next message: vijay: "[PATCH] Update SQLite version in get-deps.sh"
Previous message: Greg Stein: "Re: Tests failing with serf"
Next in thread: C. Michael Pilato: "Re: Optimizing properties for checkout/update with ra_serf"
Reply: C. Michael Pilato: "Re: Optimizing properties for checkout/update with ra_serf"
Reply: C. Michael Pilato: "Re: Optimizing properties for checkout/update with ra_serf"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]