[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

AW: Compressed Pristines (Summary)

From: Markus Schaber <m.schaber_at_3s-software.com>
Date: Mon, 2 Apr 2012 09:30:38 +0000

Hi, Ashod,

First, thanks for your great summary. I'll throw in just my 2 cents below.

> Von: Ashod Nakashian [mailto:ashodnakashian_at_yahoo.com]
 
> Pristine files currently incur almost 100%[2] overhead both in terms of
> disk footprint and file count in a given WC. Since pristine files is a
> design element of SVN, reducing their inherent overhead should be a
> welcome improvement to SVN from a user's perspective. Due to the nature of
> source files that tend to be small, the footprint of a pristine store (PS)
> is larger on disk than the actual total bytes because of internal
> fragmentation (file-system block-size rounding waste) - see references for
> numbers.

Was any of those tests actually executed on a file system supporting something like "block suballocation", "tail merging" or "tail packing"?

Today, I was rather surprised that my pristine subdir of one of our main projects which contains 726 MB of data has an actual disk size of 759 MB, which leads to an overhead of less than 4% due to block-size rounding. (According to the Explorer "Properties" dialog of Win 7 on a NTFS file system.)

AFAICS, "modern" file systems increasingly support that kind of feature[1], so we should at least think about how much effort we want to throw at the "packing" part of the problem if it's likely to vanish (or, at least, being drastically reduced) in the future. My concern is that storing small pristines in their own SQLite database will also bring some overhead that may be in the same magnitude of 4%, due to SQLite Metadata, the necessary primary key column, and indexing.

Additionally, the simple and efficient way of storing the pristines in a SQLite database (one blob per file) also prevents us from exploiting inter-file redundancies during compression, while adding a packing layer on top of sqlite leads to both high complexity and a large average blob size, and large blobs are probably more efficiently handled by the FS directly.

To cut it short: I'll "take" whatever solution emerges, but my gut feeling tells me that we should use plain files as containers, instead of using sqlite.

The other aspects (grouping similar files into the same container before compression, applying a size limit for containers, and storing uncompressible files in uncompressed containers) are fine as discussed.

I'll try to run some statistics using publicly available projects on an NTFS file system, just for comparision.

Best regards

Markus Schaber

[1]: http://msdn.microsoft.com/en-us/library/windows/desktop/ee681827%28v=vs.85%29.aspx claims tail packing support for NTFS. http://en.wikipedia.org/wiki/Block_suballocation claims support for BtrFS, ReiserFS, Reiser4, FreeBSD UFS2. And AFAIR, XFS has a similar feature. Sadly, Ext[2,3,4] are not on that list yet, but rumors claim that Ext4 is to be replaced by BtrFS in the long run.

-- 
___________________________
We software Automation.
3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50
Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 
Received on 2012-04-02 11:31:18 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.