Re: Compressed Pristines (Custom Format?)
From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Fri, 23 Mar 2012 02:22:44 -0700 (PDT)
----- Original Message -----
> From: Philip Martin <philip.martin_at_wandisco.com>
The current major concern seems to be the development of a new custom file-format that isn't warranted. This is a fair concern and one we should reach a consensus on before moving forward. I've tried to combine all issues raised on the topic so far in this (admittedly long) mail.
The design's fundamental assumption is that source files are typically smaller than a typical FS block (after compression). Eric and Philip have ran tests on SVN and GCC respectively with different results. I do not have hard-figures because it's near impossible to define "a typical project". However, let me point out the rationale behind this argument:
1. We don't care as much about file sizes before compression as we do *after* compression (with better compression, more files should fall into the sub-block size, which depends on the FS config, after compression).
The gist of the above is that if we choose a better-than-gz compression and combine the files both *before* and *after* compression, we'll have much more significant results than what we have now just with gz on individual files. This can be noticed using tar.bz2, for example, where the result is not unlike what we can achieve with the custom file format (although bz2 is probably too slow for our needs).
Now, does this justify the cost of a new file-format? That's reasonable to argue. My take is that the proposed file-format is simple enough and the gains (especially on large projects with many branches/tags checked) should justify this overhead.
I also like the fact that the pristine files are opaque and don't encourage the user to mess with them. Markus raised this point as "debuggability". I don't see "debuggability" as a user requirement (it is justifiably an SVN dev/maintainer requirement) and I don't find reason to add it as one. On the contrary, there are many reasons to suspect the user is doing something gravely wrong when they mess with the pristine files.
Another point raised by Markus is to store "common pristine" files and reuse them to reduce network traffic. This is neither part of this feature, nor can we determine what's "common" and we shouldn't optimize repository protocol in the WC client. However, if the user checks-out multiple branches/tags in the same WC tree, they will get savings and if we combine the pristine files *before* compression, the savings should be significant as most files change little between branches (if nothing is changed, they'll have the same hash and only one copy will exist in the pristine store anyway). This latter advantage will be lost on per-file compression (point #3 above).
Sqlite may be used as Branko has suggested. I'm not opposed to this. It has it's shortcomings (not exploiting inter-file similarities which point #3 makes, for one) but it can be considered as a compromise between individual gz files and the custom pack file. The basic idea would be to store "small" files (after compression) in wc.db and have "link" to compressed files on disk for "large" files. My main concern is that frequent updates to small files will leave the sqlite file with heavy external fragmentation (holes within the file unused but consuming disk-space). The solution is to "vacuum" wc.db, but that depends on its size, will lock it when vacuuming and other factors, so we can't do it as routine. The custom pack file would take care of this by splitting the files such that avoiding external fragmentation would be feasible and routine (see the numbers in the doc for full defragmentation on typical pack store files). Of course we can use multiple
A point regarding concurrency was raised (apologies for missing the name) regarding the custom format. Short answer: Yes! Files will be grouped and each group may be streamed to a compressor instance on separate threads and written to disk in parallel (thanks to splitting the pack files). The index file or wc.db may be a bottleneck, but the slow part is compressing, not updating entries.
So, again, the justification for the custom format are the points mentioned above. With the right compression algorithm, the custom format should give us a lot of flexibility and will result in disk savings that are significant to small and large WC's alike.
If I missed some point, please bring it up again, I don't mean to ignore them. Thanks for reading thus far.
 Compare total disk space of individual gz files and a tar.gz. Try the same with tar.bz2 which has a much larger window and yields significantly better compression. Boost 1.49 shows ~22% gain between tar.gz and tar.bz2 (http://sourceforge.net/projects/boost/files/boost/1.49.0/).
This is an archived mail posted to the Subversion Dev mailing list.