[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Using LZ4 compression by default

From: Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
Date: Fri, 18 Aug 2017 22:58:13 +0300

Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com> writes:

> (B) For the on-disk data, we start using LZ4 compression by default
> (in format 8 repositories).
> The reasoning behind this is that currently, zlib compression is a
> hotspot that can limit the performance of both read and write
> operations on the repository. It also affects how well Subversion
> works when dealing with large and, possibly, incompressible files
> (and I tend to think that it's a fairly important use case).
> Switching to a faster compression algorithm that is also used by other
> various file system implementations should improve the performance of
> such operations in a visible way. Please note that this change is a
> trade-off between the compression ratio and speed of the operations.
> The repositories using LZ4 compression would require a bit more disk
> space. The amount of the required additional space is proportional
> to the difference between the compression ratio of LZ4 and zlib-5,
> which can be roughly estimated as around 30-35% for compressible
> binary and text files, although that may vary depending on the
> actual data.
> To illustrate how these changes will affect the speed of some of the
> operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> environment takes 18 seconds instead of 63 seconds.

Here are some additional zlib-5 vs. LZ4 benchmarks to consider:

  (All tests were performed on the SSD drive using the file:// protocol.
   The results should be interpreted as "before is zlib-5, after is LZ4".
   Also, the results over http:// are somewhat similar in terms of the
   improvement factor and are omitted for brevity. "Import time " is
   for "svn import", "Export time" is for "svnbench null-export".)

 - One compressible file, 1.17 GB:

   Import time: 40.79 s → 11.97 s (3.4 x faster)
   Export time: 6.30 s → 3.13 s (2.0 x faster)
   Compression ratio: 31.8 % → 43.8% (384 MB → 529 MB on disk)

 - One incompressible file, 833 MB:

   Import time: 32.16 s → 8.22 s (3.9 x faster)
   Export time: 2.71 s → 2.06 s (1.3 x faster)
   Compression ratio: 91.9 % → 93.3% (766 MB → 778 MB on disk)

 - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files:

   Import time: 17.83 s → 10.36 s (1.7 x faster)
   Export time: 1.62 s → 1.15 s (1.4 x faster)
   Compression ratio: 35.2 % → 48.8 % (75 MB → 104 MB on disk)

 - Multiple binary files, 1.68 GB, 25 files:

   Import time: 55.10 s → 15.84 s (3.5 x faster)
   Export time: 8.56 s → 4.34 s (2.0 x faster)
   Compression ratio: 38.4 % → 46.9 % (662 MB → 807 MB on disk)

Reiterating over the whole topic of the default compression algorithm for
the repositories, I think that we have the following options to choose from:

 (1) Make LZ4 compression optional in format 8 repositories, and still use
     zlib-5 compression by default.

    With this approach, users would have to have "compression=lz4" in
    fsfs.conf to use it. Personally, I would expect a number of such users
    to be quite low, because they would have to both upgrade the repository
    to fsfs format 8 and use non-default fsfs.conf settings.

    This option means that we'd keep our existing performance characteristics
    with read and write operations being limited by the compression speed
    of zlib-5 (which isn't exactly fast) for most of the users. It also means
    that the expected size and the compression ratio of the repository data
    would remain unchanged.

 (2) Compress with LZ4 by default in all (new and upgraded) format 8

    This approach means that a much bigger part of our users will have
    their data compressed with LZ4, and will get the visible read and write
    performance improvement. It also means that the compression ratio of
    the on disk data will be lower than with zlib-5, and the projected
    size of the repositories will increase accordingly.

    One additional point to consider here is that such change may be
    going a bit against the policy of adding a new optional feature and
    switching the default in the next minor release.

 (3) Compress with LZ4 by default, but only in new format 8 repositories.

    This option is similar to (2), but with a more limited scope where
    LZ4 compression is only used for the new repositories created with
    Subversion 1.10 binaries.

Personally, I find the significant speed improvement for both read and write
operations from LZ4 compression quite important, and I think that the actual
reduction in the compression ratio is acceptable, considering the gained
benefits. I also think that the risks associated with switching the default
on-disk format are low in this particular case, considering that the LZ4
library is stable. (It has been available for a long time and is used by
projects like Linux Kernel and ZFS).

In other words, I think that we would benefit from using LZ4 compression
by default.

Among the options (2) and (3) that make LZ4 the new default compression
algorithm, I think that option (2) is better. The reasoning here is that
using LZ4 compression would improve the performance even for existing
repositories by making new commits faster and by speeding up read
operations for the new committed files. Apart from this, option (3)
needs implementation and is probably going to have a couple of related
challenges, which can be otherwise avoided.

With all that in mind, I propose that we do (2). Any objections?

Evgeny Kotkov
Received on 2017-08-18 21:58:49 CEST

This is an archived mail posted to the Subversion Dev mailing list.