[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

[RFC] Using LZ4 compression by default

From: Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
Date: Wed, 2 Aug 2017 21:59:16 +0300

Hi all,

With the recently added support for LZ4 compression (r1801940 et al),
we now have an option of using it by default for the on-disk data and
over the wire.

For those who haven't been following this topic, here's a quick recap:

 - Currently, our default compression algorithm is zlib.

 - LZ4 offers much faster compression and decompression speed than zlib
   and includes a heuristic to skip incompressible data.

 - LZ4 has worse compression ratio than zlib-5 (our current default).

   In this dimension, it is more or less comparable with the compression
   ratio of zlib-1, although zlib-1 still has a slightly better compression
   ratio. See https://quixdb.github.io/squash-benchmark/ for additional
   information on this (the codecs to compare are "lz4 - 7" and "zlib - 1").

 - Only the new filesystem format 8 allows using LZ4 for the on-disk data.

 - Using LZ4 over the wire requires both endpoints to advertise that they
   know how to deal with the new svndiff2 format that allows LZ4 compression.

There are two questions to consider:

 (1) Do we want to start using LZ4 compression over the wire by default?
     If yes, do we want this default to apply to all installations or to
     only affect part of the installations where it makes sense?

 (2) Do we want to switch to the LZ4 compression for the on-disk data
     by default?

I propose the following approach. Please note that for the wire format
part, it only considers the http:// protocol, but we can optionally adjust
svn:// later:

 (A) For the HTTP wire format, we start using LZ4 compression by default,
     but only over local networks.

     The reasoning behind this is that we probably wouldn't want to start
     always using LZ4 compression, as that would result in a regression over
     WAN, where the better compression ratio is usually preferable to the
     compression performance. Another point is that even for local networks
     we cannot disable compression altogether, because for slow 10 or even
     100 Mbps LANs, where the throughput is limited by the slow network,
     using fast compression can be better than no compression. This is
     where LZ4 comes to the rescue by offering reasonable compression
     ratio and fast compression speed.

     This approach is currently implemented with the http-compression=auto
     client-side configuration option (r1803899), which is the new default.
     While the HTTP client is generally in charge of the used compression
     algorithm, there's also a way to override its preference on the server.
     If the mod_dav_svn's SVNCompressionLevel directive is set to 1, a
     server would then override the client's preference and still send
     svndiff2 / LZ4 data if the client can accept it.

 (B) For the on-disk data, we start using LZ4 compression by default
     (in format 8 repositories).

     The reasoning behind this is that currently, zlib compression is a
     hotspot that can limit the performance of both read and write
     operations on the repository. It also affects how well Subversion
     works when dealing with large and, possibly, incompressible files
     (and I tend to think that it's a fairly important use case).

     Switching to a faster compression algorithm that is also used by other
     various file system implementations should improve the performance of
     such operations in a visible way. Please note that this change is a
     trade-off between the compression ratio and speed of the operations.
     The repositories using LZ4 compression would require a bit more disk
     space. The amount of the required additional space is proportional
     to the difference between the compression ratio of LZ4 and zlib-5,
     which can be roughly estimated as around 30-35% for compressible
     binary and text files, although that may vary depending on the
     actual data.

To illustrate how these changes will affect the speed of some of the
operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
environment takes 18 seconds instead of 63 seconds.

How does this sound? Are there any objections or suggestions to the
proposed approach?

(Please note that most of the implementation is already in place, and to
 get the described behavior we would just have to change a couple of default
 settings.)

Thanks,
Evgeny Kotkov
Received on 2017-08-02 20:59:41 CEST

This is an archived mail posted to the Subversion Dev mailing list.