[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Compressed Pristines (Summary)

From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Wed, 4 Apr 2012 02:39:05 -0700 (PDT)

Combined response inline...

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>First, thanks for your great summary. I'll throw in just my 2 cents below.

The pleasure is mine.

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>Was any of those tests actually executed on a file system supporting something like "block suballocation", "tail merging" or "tail packing"?

No, not to my knowledge. Mine was on standard installations of Ubuntu 11.10. And I was trying to calculate the waste on a system that *didn't* have them enabled.

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>Today, I was rather surprised that my pristine subdir of one of our main projects which contains 726 MB of data has an actual disk size of 759 MB, which leads to an overhead of less than 4% due to block-size rounding. (According to the Explorer "Properties" dialog of Win 7 on a NTFS file system.)

Did you have NTFS compression enabled?

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>AFAICS, "modern" file systems increasingly support that kind of feature[1], so we should at least think about how much effort we want to throw at the "packing" part of the problem if it's likely to vanish (or, at least, being drastically reduced) in the future.

[snip]

> From: Mark Therieau <mtherieau_at_gmail.com>
>
>Another thought would be to pursue a FUSE-like approach similar to scord [1][2]
[snip]

> From: Julian Foad <julianfoad_at_btopenworld.com>
>
>1.  Filesystem compression.
>
>Would you like to assess the feasibility of compressing the pristine store by re-mounting the "pristines" subdirectory as a compressed subtree in the operating system's file system?

No :-)

There are two ways to answer this interesting proposition of compressed file-systems. The obvious one is that it isn't something SVN can or should control. The file-system and certainly system drivers are up to the user and any requirement or suggestion of tempering with them is decidedly unwarranted and unexpected from a VCS.

The second is more relevant, however. The user may *still* enable/use these schemes with or without compressed pristine support. After all, we are only concerned with the pristine store and *not* the working copy. So there is still room for these technologies, if/when the user feels so inclined to utilize them.

So I'd say there is nothing preventing the user from using them, at their responsibility, and get further gains in disk savings while at the same time they are markedly out of scope for compressed pristines feature, if not SVN as a system.

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>Additionally, the simple and efficient way of storing the pristines in a SQLite database (one blob per file) also prevents us from exploiting inter-file redundancies during compression, while adding a packing layer on top of sqlite leads to both high complexity and a large average blob size, and large blobs are probably more efficiently handled by the FS directly.

Yes. That's what the proposal I drafted is claiming.

> From: Markus Schaber <m.schaber_at_3s-software.com>
>
>To cut it short: I'll "take" whatever solution emerges, but my gut feeling tells me that we should use plain files as containers, instead of using sqlite.
>
>The other aspects (grouping similar files into the same container before compression, applying a size limit for containers, and storing uncompressible files in uncompressed containers) are fine as discussed.
>
>I'll try to run some statistics using publicly available projects on an NTFS file system, just for comparision.
>

That would be great. Please share your finds.

> From: Mark Therieau <mtherieau_at_gmail.com>
>
>If the full goal is to reduce pressure on the underlying file system in the presence
>of many large working copies (e.g. one per branch) then duplicate pristine contents,
>even with super-awesome compression would not match the space savings of a
>de-duplicated, pristine-aware, copy-on-write file system.

That's assuming there are many duplicates. This is certainly possible, especially with many branches/tags checked out from the same source. But I suspect it's a more common scenario to have a single branch checked out from different repositories. In other words, unless we have solid numbers that there is more savings by de-duplication, the working assumption is that improving a single branch by compression will be more useful to more users. Plus, your suggestion is probably part of the unified pristine store (aka ~/.svn) which is out of scope for compressed pristines.

> From: Julian Foad <julianfoad_at_btopenworld.com>
>
>The pristine store implementation also needs to provide
*uncompressed* copies of the files.  Some of the API consumers can and
should read the data through svn_stream_t; this is the easy part.  Other API consumers -- primarily those that invoke an external 'diff' tool -- need to be given access to a complete uncompressed file on disk.

This is certainly a -minor- complication we'll have to deal with. It's just a technicality, not a show stopper or a problem per-se. The pristine/tmp folder could be cleaned up via svn cleanup, for example, or at different check-points. The worse case scenarios are to either to clutter the disk by too many temp uncompressed pristines or to delete them prematurely and force the user to re-run their last command. These aren't fatal and it's easy to find a middle-ground to handle them.

-Ash
Received on 2012-04-04 11:39:49 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.