[PROPOSAL] Addition of rsync algorithm for saving bandwidth in 'svn takeover'

From: Jonathan Gilbert <o2w9gs702_at_sneakemail.com>
Date: 2005-07-11 23:27:24 CEST

I recently posted to the list a patch that adds a new feature to the 'svn'
command-line client. This feature, called 'svn takeover', permits an
existing working copy, possibly with changes, to be made into an SVN
working copy for a given revision from the server. The idea is to only add
files to the WC that are not already there.

In its implementation, it was nothing more than a hack ontop of 'svn
checkout'. The 'checkout' command does all the work of building the '.svn'
admin subdirs for a working copy, and then places the files there through
an 'update'. With my patch, 'svn takeover' performs a checkout that simply
ignores files that already exist. It re-downloads a pristine 'text-base'
from the server, using the same "difference from revision 0" mechanism that
'svn checkout' uses. In fact, the patch actually uses all of the existing
code for 'svn checkout'; it merely sets a boolean indicating whether a
takeover is in progress (as opposed to a checkout), and if it is, it does
not emit the "copy from .tmp/text-base into WC" log command inside
update-editor.c:install_file().

This patch works properly, and does everything it claims to do: It turns a
source tree into a properly-formed SVN working copy. However, after
bringing my patch to the table, some people mentioned that a feature like
this had been requested in the past, but in the context of reducing wasted
network bandwidth. As my patch does not prevent the transfer of existing
files (indeed, it does not even check whether the files are the same as the
repository's copies), it completely fails at this aspect. I did some
research, and it seemed at first as though it might be possible to handle
this as a new kind of 'update' command in the client, but it turned out in
the end that the current 'libsvn_ra' interface (and thus SVN protocol) is
not actually capable of retrieving the checksums for files without
transferring them (or, more accurately, transferring the last diff to the
file in the repository -- something which also cannot be determined
remotely, since in general a file will *not* have changed between the
previous revision and head).

So, it seems a much more involved change is required. A new command needs
to be added to the libsvn_ra interface, and subsequently to all the
providers implementing the interface, to transmit enough information about
a file to the server to identify it with high probability, and, on the
server side of things, to avoid transmitting content that the client
already has.

I initially contemplated computing the checksum of each file in the working
copy; the new command would then send these checksums to the server for
comparison with the requested revision. However, if a large file has been
modified slightly, this completely fails to make use of existing data in
the file. In a post in my original thread about the takeover feature, Andre
Poenitz mentioned rsync. I read up on how rsync works, and the algorithm
seemed to be the perfect balance between complexity and functionality (that
is, it's not too complex to implement -- relatively easy, actually -- and
it could potentially make a big difference for people using 'svn takeover'
on very large subtrees).

In the course of implementing the rsync algorithm, however, there are
several questions that will need to be addressed:

1. The rsync algorithm works by treating files as sequences of
non-overlapping blocks, and taking the checksum of each block in order to
send to the server. The size of the block must be determined at some point.
It would be relatively easy to make it a configurable option, but then
limits are required.

The server must cache all checksums of all blocks in a file in order to
generate the edit script to compose the new file, so if a file is too large
and the block size too small, there will be too many blocks and the server
will end up allocating a very large amount of memory. As an example, with a
2 KB block size against a 20 GB file, the server will be using on the order
of 200 MB of RAM during the processing of that file! Some people would
think this is inappropriate, so I think some sort of limiting mechanism is
definitely in order. Perhaps the block size could be limited based on the
size of the file. For example, if a 20 GB file could not use blocks smaller
than, say, 64 KB, the memory usage on the server would be less than 10 MB
to handle that file.

However, since 'svn_stream_t's are not seekable, both the client and the
server effectively need to store one entire block in memory, in order to
compute the rolling checksum; if a well-endowed user specified a 1 GB block
size, it could constitute a denial of service attack on the server! So, an
upper bound is also a necessity.

Files larger than 20 GB do not come around often, and even at 20 GB, with a
block size of 64 KB, the amount of memory needed to cache all of the
checksums is on the order of 8 to 9 megabytes. As such, I think 64 KB is a
decent upper limit on the block size; few people would complain about an
apr_pcalloc(65536) :-)

Anyone have any other ideas in this area?

2. The 'rsync' implementation of the rsync algorithm operates on a per-file
basis. Thus, it is not able to use blocks from one file to construct
another file. This is in part related to the memory usage problem in point
#1, but for medium-size repositories, if network bandwidth is really at a
premium (which is what has been expressed to me in the 'svn takeover'
thread), if there is any chance that a file could make use a block from
another file in the WC, it should be taken. Basically, what I am inquiring
about here is whether the implementation should support having the server
consider all blocks from all files in the WC simultaneously, or whether it
should work strictly on a per-file basis.

There is another aspect to this as well: In order to put the 'text-base'
into a consistent state, files in the repository that do not yet exist in
the source tree being converted into a WC must also be transferred. If all
blocks across all client files are being considered, there is some chance
that content from other files could be used when building the edit script
to produce such non-existent files. Even something as innocuous as a
copyright notice at the top of each source code file could take advantage
of this, if it could fill an entire block, and this becomes even more
useful if files are being renamed in the repository.

So, there are definitely pros and cons to this. The cons can be summed up
basically as "20 GB repository", as opposed to the 20 GB single file
considered in point #1. Probably, using 64 KB blocks (as suggested in point
#1 for extremely large files) will prevent any efficiency for source code
files that have been changed, because changes to source code files are
typically on a much smaller scale than 64 KB.

Perhaps this should be made into a command-line option which bails out if
it would require the server to use more than a certain threshold of memory?
The server would, of course, also need some form of flood control,
otherwise people could simply hack their clients to create denial of
service tools... I don't think such flood control would be too difficult,
though; the server could simply count the checksums as they came in and
cancel the operation if the number exceeded a threshold, configurable in
the repository's 'conf' subdir.

3. To perform well, 'rsync' generally likes to operate in a pipelined
manner. That is, one thread in the client computes and transfers checksums
continuously to the server, and another thread continuously reads & applies
edit scripts from the server. The server then reads checksums from the
first client thread until it has enough for one file, and then stops to
process that file and send the resulting edit script to the second client
thread. (This of course only applies if the algorithm is being used in
per-file rather than entire-WC mode.)

This is a concern when the WebDAV RA provider is being used. As I
understand it, WebDAV, being an HTTP-based protocol, is not capable of
arbitrary bidirectional communication. The server cannot send to the client
until the client has requested something from the server, and the client
cannot send data to the server while the server is still in the process of
sending data back to the client. Does this apply to the way SVN uses WebDAV
through Apache?

If not, then there is no problem, but if it does, then some means of
permitting the server to send data back to the client as it is receiving
the checksums must be devised if pipelining is to be maintained. Perhaps
the server could rendezvous a second connection from the client (and thus a
second RA session) to be used for server->client communication?

I'm sure there are other things that people will want to discuss. One of
them might be: "Is it worthwhile to do this at all?" I would like to assert
that it is worthwhile because while many people won't care about it, it
will be important for the people who wanted the 'svn takeover' command in
its bandwidth-saving form in the first place. If the command is going to be
added, it will be done for them, so whether this functionality is important
enough to include should be determined based on the behaviour those people
want, and not whether it will be important for those people who will never
use the feature. :-)

Comments, please.

Jonathan Gilbert

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 11 23:27:23 2005

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]