[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Using LZ4 compression by default

From: Jacek Materna <jacek_at_assembla.com>
Date: Fri, 18 Aug 2017 17:06:48 -0500

(2) is the best path for USERS of subversion. More toggles is mired in risk
and adding complexity. Improvements should "just work" out the box - unless
there is some technical hurdle. A 25% increase in disk usage is nothing
today for even a fraction more speed on operations happening thousands of
times a day on a typical team. However, this is more than a fraction!

Great quantitative metrics Evgeny.

On Fri, Aug 18, 2017 at 2:58 PM, Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
wrote:

> Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com> writes:
>
> > (B) For the on-disk data, we start using LZ4 compression by default
> > (in format 8 repositories).
> >
> > The reasoning behind this is that currently, zlib compression is a
> > hotspot that can limit the performance of both read and write
> > operations on the repository. It also affects how well Subversion
> > works when dealing with large and, possibly, incompressible files
> > (and I tend to think that it's a fairly important use case).
> >
> > Switching to a faster compression algorithm that is also used by
> other
> > various file system implementations should improve the performance
> of
> > such operations in a visible way. Please note that this change is a
> > trade-off between the compression ratio and speed of the operations.
> > The repositories using LZ4 compression would require a bit more disk
> > space. The amount of the required additional space is proportional
> > to the difference between the compression ratio of LZ4 and zlib-5,
> > which can be roughly estimated as around 30-35% for compressible
> > binary and text files, although that may vary depending on the
> > actual data.
> >
> > To illustrate how these changes will affect the speed of some of the
> > operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> > environment takes 18 seconds instead of 63 seconds.
>
> Here are some additional zlib-5 vs. LZ4 benchmarks to consider:
>
> (All tests were performed on the SSD drive using the file:// protocol.
> The results should be interpreted as "before is zlib-5, after is LZ4".
> Also, the results over http:// are somewhat similar in terms of the
> improvement factor and are omitted for brevity. "Import time " is
> for "svn import", "Export time" is for "svnbench null-export".)
>
> - One compressible file, 1.17 GB:
>
> Import time: 40.79 s → 11.97 s (3.4 x faster)
> Export time: 6.30 s → 3.13 s (2.0 x faster)
> Compression ratio: 31.8 % → 43.8% (384 MB → 529 MB on disk)
>
> - One incompressible file, 833 MB:
>
> Import time: 32.16 s → 8.22 s (3.9 x faster)
> Export time: 2.71 s → 2.06 s (1.3 x faster)
> Compression ratio: 91.9 % → 93.3% (766 MB → 778 MB on disk)
>
> - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files:
>
> Import time: 17.83 s → 10.36 s (1.7 x faster)
> Export time: 1.62 s → 1.15 s (1.4 x faster)
> Compression ratio: 35.2 % → 48.8 % (75 MB → 104 MB on disk)
>
> - Multiple binary files, 1.68 GB, 25 files:
>
> Import time: 55.10 s → 15.84 s (3.5 x faster)
> Export time: 8.56 s → 4.34 s (2.0 x faster)
> Compression ratio: 38.4 % → 46.9 % (662 MB → 807 MB on disk)
>
>
> Reiterating over the whole topic of the default compression algorithm for
> the repositories, I think that we have the following options to choose
> from:
>
> (1) Make LZ4 compression optional in format 8 repositories, and still use
> zlib-5 compression by default.
>
> With this approach, users would have to have "compression=lz4" in
> fsfs.conf to use it. Personally, I would expect a number of such users
> to be quite low, because they would have to both upgrade the repository
> to fsfs format 8 and use non-default fsfs.conf settings.
>
> This option means that we'd keep our existing performance
> characteristics
> with read and write operations being limited by the compression speed
> of zlib-5 (which isn't exactly fast) for most of the users. It also
> means
> that the expected size and the compression ratio of the repository data
> would remain unchanged.
>
> (2) Compress with LZ4 by default in all (new and upgraded) format 8
> repositories.
>
> This approach means that a much bigger part of our users will have
> their data compressed with LZ4, and will get the visible read and write
> performance improvement. It also means that the compression ratio of
> the on disk data will be lower than with zlib-5, and the projected
> size of the repositories will increase accordingly.
>
> One additional point to consider here is that such change may be
> going a bit against the policy of adding a new optional feature and
> switching the default in the next minor release.
>
> (3) Compress with LZ4 by default, but only in new format 8 repositories.
>
> This option is similar to (2), but with a more limited scope where
> LZ4 compression is only used for the new repositories created with
> Subversion 1.10 binaries.
>
>
> Personally, I find the significant speed improvement for both read and
> write
> operations from LZ4 compression quite important, and I think that the
> actual
> reduction in the compression ratio is acceptable, considering the gained
> benefits. I also think that the risks associated with switching the
> default
> on-disk format are low in this particular case, considering that the LZ4
> library is stable. (It has been available for a long time and is used by
> projects like Linux Kernel and ZFS).
>
> In other words, I think that we would benefit from using LZ4 compression
> by default.
>
> Among the options (2) and (3) that make LZ4 the new default compression
> algorithm, I think that option (2) is better. The reasoning here is that
> using LZ4 compression would improve the performance even for existing
> repositories by making new commits faster and by speeding up read
> operations for the new committed files. Apart from this, option (3)
> needs implementation and is probably going to have a couple of related
> challenges, which can be otherwise avoided.
>
> With all that in mind, I propose that we do (2). Any objections?
>
>
> Thanks,
> Evgeny Kotkov
>

-- 
Jacek Materna
Chief Technology Officer
Assembla
+1 210 410 7661
Received on 2017-08-19 00:07:36 CEST

This is an archived mail posted to the Subversion Dev mailing list.