Re: Compressed Pristines (Custom Format?)

From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Fri, 23 Mar 2012 02:22:44 -0700 (PDT)

----- Original Message -----

> From: Philip Martin <philip.martin_at_wandisco.com>
> To: Erik Huelsmann <ehuels_at_gmail.com>
> Cc: Ashod Nakashian <ashodnakashian_at_yahoo.com>; Daniel Shahaf <danielsh_at_elego.de>; "dev_at_subversion.apache.org" <dev_at_subversion.apache.org>; Ivan Zhakov <ivan_at_visualsvn.com>
> Sent: Thursday, March 22, 2012 4:00 PM
> Subject: Re: Compressed Pristines (Design Doc)
>
> Erik Huelsmann <ehuels_at_gmail.com> writes:
>
>> As the others, I'm surprised we seem to be going with a custom file
> format.
>> You claim source files are generally small in size and hence only small
>> benefits can be had from compressing them, if at all, due to the fact that
>> they would be of sub-block size already.
>
> I was surprised too, so I looked at GCC where a trunk checkout has
> 75,000 files of various types:
>
[snip]

The current major concern seems to be the development of a new custom file-format that isn't warranted. This is a fair concern and one we should reach a consensus on before moving forward. I've tried to combine all issues raised on the topic so far in this (admittedly long) mail.

The design's fundamental assumption is that source files are typically smaller than a typical FS block (after compression). Eric and Philip have ran tests on SVN and GCC respectively with different results. I do not have hard-figures because it's near impossible to define "a typical project". However, let me point out the rationale behind this argument:

1. We don't care as much about file sizes before compression as we do *after* compression (with better compression, more files should fall into the sub-block size, which depends on the FS config, after compression).
2. Compressed file-size is highly dependent on the compression algorithm (we should use the best compression that meets our reqs).
3. Combining files *before* compression in many cases yield better compression, especially if multiple tags/branches are involved.[1]
4. Projects that have "small" files will suffer more by the wasted
sub-block space especially when multiple tags/branches are checked out
(typical for active maintainers).
5. Reducing the number of files on disk can improve overall disk performance (for very large projects).[2]
6. Flexibility, extensibility and opaqueness.

The gist of the above is that if we choose a better-than-gz compression and combine the files both *before* and *after* compression, we'll have much more significant results than what we have now just with gz on individual files. This can be noticed using tar.bz2, for example, where the result is not unlike what we can achieve with the custom file format (although bz2 is probably too slow for our needs).

Now, does this justify the cost of a new file-format? That's reasonable to argue. My take is that the proposed file-format is simple enough and the gains (especially on large projects with many branches/tags checked) should justify this overhead.

I also like the fact that the pristine files are opaque and don't encourage the user to mess with them. Markus raised this point as "debuggability". I don't see "debuggability" as a user requirement (it is justifiably an SVN dev/maintainer requirement) and I don't find reason to add it as one. On the contrary, there are many reasons to suspect the user is doing something gravely wrong when they mess with the pristine files.

Another point raised by Markus is to store "common pristine" files and reuse them to reduce network traffic. This is neither part of this feature, nor can we determine what's "common" and we shouldn't optimize repository protocol in the WC client. However, if the user checks-out multiple branches/tags in the same WC tree, they will get savings and if we combine the pristine files *before* compression, the savings should be significant as most files change little between branches (if nothing is changed, they'll have the same hash and only one copy will exist in the pristine store anyway). This latter advantage will be lost on per-file compression (point #3 above).

Sqlite may be used as Branko has suggested. I'm not opposed to this. It has it's shortcomings (not exploiting inter-file similarities which point #3 makes, for one) but it can be considered as a compromise between individual gz files and the custom pack file. The basic idea would be to store "small" files (after compression) in wc.db and have "link" to compressed files on disk for "large" files. My main concern is that frequent updates to small files will leave the sqlite file with heavy external fragmentation (holes within the file unused but consuming disk-space). The solution is to "vacuum" wc.db, but that depends on its size, will lock it when vacuuming and other factors, so we can't do it as routine. The custom pack file would take care of this by splitting the files such that avoiding external fragmentation would be feasible and routine (see the numbers in the doc for full defragmentation on typical pack store files). Of course we can use multiple
sqlite database files instead of the custom format and achieve the same goal, but my suspicion is that using sqlite for small files in the long run will probably give similar results as individual gz files (due to overhead, external fragmentation etc), so personally I feel it's probably not worth it.

A point regarding concurrency was raised (apologies for missing the name) regarding the custom format. Short answer: Yes! Files will be grouped and each group may be streamed to a compressor instance on separate threads and written to disk in parallel (thanks to splitting the pack files). The index file or wc.db may be a bottleneck, but the slow part is compressing, not updating entries.

So, again, the justification for the custom format are the points mentioned above. With the right compression algorithm, the custom format should give us a lot of flexibility and will result in disk savings that are significant to small and large WC's alike.

If I missed some point, please bring it up again, I don't mean to ignore them. Thanks for reading thus far.

[1] Compare total disk space of individual gz files and a tar.gz. Try the same with tar.bz2 which has a much larger window and yields significantly better compression. Boost 1.49 shows ~22% gain between tar.gz and tar.bz2 (http://sourceforge.net/projects/boost/files/boost/1.49.0/).
[2] My painful experience is with WebKit, 7 branches full-checkout on ntfs (2+ million total files on partition). Prior to svn 1.7 the number of files/folders was even worse.

Cheers,
Ash
Received on 2012-03-23 10:23:27 CET

This message: [ Message body ]
Next message: Ben Smith-Mannschott: "Re: Compressed Pristines (Custom Format?)"
Previous message: Branko Čibej: "Re: Symmetric Merge"
In reply to: Philip Martin: "Re: Compressed Pristines (Design Doc)"
Next in thread: Ben Smith-Mannschott: "Re: Compressed Pristines (Custom Format?)"
Reply: Ben Smith-Mannschott: "Re: Compressed Pristines (Custom Format?)"
Reply: Markus Schaber: "AW: Compressed Pristines (Custom Format?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]