[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Compressed Pristines (Custom Format?)

From: Ben Smith-Mannschott <bsmith.occs_at_gmail.com>
Date: Fri, 23 Mar 2012 10:59:32 +0100

On Fri, Mar 23, 2012 at 10:22, Ashod Nakashian <ashodnakashian_at_yahoo.com> wrote:
> The design's fundamental assumption is that source files are typically smaller than a typical FS block (after compression). Eric and Philip have ran tests on SVN and GCC respectively with different results. I do not have hard-figures because it's near impossible to define "a typical project". However, let me point out the rationale behind this argument:
> 1. We don't care as much about file sizes before compression as we do *after* compression (with better compression, more files should fall into the sub-block size, which depends on the FS config, after compression).
> 2. Compressed file-size is highly dependent on the compression algorithm (we should use the best compression that meets our reqs).
> 3. Combining files *before* compression in many cases yield better compression, especially if multiple tags/branches are involved.[1]
> 4. Projects that have "small" files will suffer more by the wasted
> sub-block space especially when multiple tags/branches are checked out
> (typical for active maintainers).
> 5. Reducing the number of files on disk can improve overall disk performance (for very large projects).[2]
> 6. Flexibility, extensibility and opaqueness.
> The gist of the above is that if we choose a better-than-gz compression and combine the files both *before* and *after* compression, we'll have much more significant results than what we have now just with gz on individual files. This can be noticed using tar.bz2, for example, where the result is not unlike what we can achieve with the custom file format (although bz2 is probably too slow for our needs).
> Now, does this justify the cost of a new file-format? That's reasonable to argue. My take is that the proposed file-format is simple enough and the gains (especially on large projects with many branches/tags checked) should justify this overhead.

It seems to me that these are the same issues that drove the design of
git's repository format, where individual items are first stored
individually gzipped (loose) and then periodically combined into
efficient pack files to save additional space and file system

Why invent a new format, when one exists that could serve?

Engineering problems are all about trade-offs, so I'd be remiss in not
mentioning the obvious down-side of using git's approach: (1)
unpredictable runtimes of individual svn commands, because they may
occasionally choose to initiate a repacking or (2) a manual 'repack'
command that a user can run when convenient.

// Ben
Received on 2012-03-23 11:00:08 CET

This is an archived mail posted to the Subversion Dev mailing list.