Re: svn:// protocol efficiency and Cascade

From: Matt Craighead <matt.craighead_at_conifersystems.com>
Date: Wed, 12 Nov 2008 17:03:12 -0600

A quick followup.

There is a problem with attempting to pipeline the get-file requests -- I
would have to assume that I will get a trivial_auth_request and not a real
auth request after each get_file. Otherwise the server will confuse my next
get-file request with an auth-response. This isn't the end of the world,
and I would assume that in most cases no further auth would be needed after
a successful get_dir (especially as we're only talking about read-only
operations), but you can never know for sure when the server might decide to
ask for authentication. I'd probably have to detect this and fall back to a
slow (non-pipelined) mode for any directory where this happened.

The other big problem with pipelining in general is deadlock avoidance --
can't have both sides blocked on a send() call (with all socket buffers
full) before they've reached their next recv() call -- but that's solvable.

I also saw that according to
http://svn.collab.net/repos/svn/trunk/subversion/libsvn_ra_svn/protocol the
"checksum" field in the get-file reply is optional, not mandatory. It would
be unfortunate if a server ever didn't return the MD5 here -- having to
query the file contents just to determine whether I already have the file
cached is pretty slow. I would hope that all SVN servers always return the
checksum? The svnserve source code seems to suggest that I will always get
a checksum in practice.

Also saw a recent email thread on a related topic:
http://svn.haxx.se/dev/archive-2008-10/0111.shtml Having been forced to
write my own WebDAV client for Subversion for Cascade to replace
libsvn_ra_dav, I absolutely agree that the current protocol is far from
optimal -- I have no particular love for WebDAV. At the same time, a simple
marshalling of the svn_ra_* APIs over HTTP would be less efficient for
Cascade than the current WebDAV-based protocol. I would hope that any new
protocol that is designed can return more information in get_dir than the
existing svn_ra_get_dir* APIs presently return. (My understanding is that
HTTP pipelining is far from robust/usable in practice.)

On Wed, Nov 12, 2008 at 1:58 PM, Matt Craighead <
matt.craighead_at_conifersystems.com> wrote:

> I was just looking into what it would take to add support for the svn://
> protocol to Cascade. Currently Cascade File System and Cascade Proxy only
> support http:// and https:// repository access.
>
>
> Background: Cascade does not make use of libsvn_ra_*. I wrote my own
> DAV client because I was unable to get acceptable performance out of
> libsvn_ra_dav -- too many round trips to the repository server. (A
> secondary problem was how to ship a binary distribution of an application
> using libsvn on Linux -- it did not appear to be possible to ship a single
> binary that would run on all Linux distros, even assuming that libsvn* were
> installed. For example, I had to know at link time whether the target
> system was using apr0 or apr1.)
>
> There were several reasons why libsvn_ra_dav generated too many network
> round trips, but the biggest one by far was that it didn't allow me to get
> the MD5 and other properties of each directory entry as part of
> svn_ra_get_dir(2). The MD5s are actually sent across the wire as part of
> the PROPFIND request, but the client library would throw them away rather
> than allowing me to see them. The same goes for properties such as
> svn:executable. The only way to get these properties through the API was to
> do a get_file request for each file in the directory. Worse still, to get
> the MD5 (even if I didn't want the file contents -- just to check if I
> already had the file cached), I had to obtain the entire file over the wire
> and MD5 it myself.
>
> Looking at the svn:// protocol, it appears that I may have the same
> problems all over again. Ignoring any limitations of libsvn_ra_svn
> and looking straight at the protocol -- I can send a get-dir request, but
> this provides very limited information about each directory entry. It
> appears that in order to implement my caching subsystem's internal
> equivalent of svn_ra_get_dir(), I may have to send quite a few queries for
> each directory entry. For example, the only way to get the MD5 appears to
> be to send a get-file request.
>
> I suppose I can attempt to pipeline my requests, so that I'm not paying the
> cost of one or more network round trips per directory entry, but has there
> been any consideration given to adding more information to the get-dir query
> in the svn:// protocol?
>
>
> Ideally, Cascade would like to be able to obtain all of the following
> information for a directory (at a particular revision number) in a single
> query that is always just a single network round trip:
> - list of directory entries
> - range of revisions (start and end revision) where these directory entries
> are valid
> - for each directory entry:
> - mode flags: is it a directory? is it a symlink? is it executable? is
> it text (based on eol-style)?
> - timestamp of last commit for last modify time
> - if it's not a directory: size of file in bytes
> - if it's not a directory: MD5 or SHA1 of file contents
> - range of revisions (start and end revision) where the above information
> for this directory entry is valid
>
> Note that even in the DAV protocol I am not able to do all of the above as
> well as I'd like. For example, I can get "start revisions" where the data
> obtained is valid via the DAV:version-name property, but I cannot get "end
> revisions", the revision of the next commit to that file minus one. Also,
> for the revision range where the list of directory entries is valid, I have
> to do a separate REPORT query to get the last commit to that tree... not
> only an extra query, but this results in a very conservative revision range
> because the probability that that commit modified the list of directory
> entries (by adding or deleting a file or directory *in that directory*, not
> just in a subdirectory) is pretty low.
>
>
> Any thoughts on enhancing the svn:// protocol to return more information in
> get-dir? Any thoughts on how to efficiently implement the query I just
> described using the existing svn:// protocol? Also, any thoughts on
> getting better revision ranges for the DAV protocol? The larger the
> revision ranges I can get, the more effective my caching will be -- the
> fewer unnecessary queries to the SVN server.
>
> --
> Matt Craighead
> Founder/CEO, Conifer Systems LLC
> http://www.conifersystems.com
> 512-772-1834
>

-- 
Matt Craighead
Founder/CEO, Conifer Systems LLC
http://www.conifersystems.com
512-772-1834

Received on 2008-11-13 01:10:39 CET

This message: [ Message body ]
Next message: Mark Phippard: "Re: Perl/Ruby bindings fail in trunk"
Previous message: Joe Swatosh: "Re: Perl/Ruby bindings fail in trunk"
In reply to: Matt Craighead: "svn:// protocol efficiency and Cascade"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]