Julian Foad wrote:
> On Thu, 2008-11-27 at 03:17 -0800, Blair Zajac wrote:
>> Hyrum K. Wright wrote:
>>> Hi all.
>>> As of r34446, the implementation of packing on fsfs is functionally complete on
>>> the fsfs-pack branch. For those that don't know, packing consists of mushing
>>> all the individual rev files in a completed shard into one file, thus
>>> eliminating the inode penalty for that entire shard. Packing a trunk-generated
>>> copy of the ASF repository saved about 1 GB on a 24 GB repo. There may be
>>> additional performance benefits in dealing with a much smaller set of rev files
>>> (OS caching, etc.), but I haven't yet investigated that.
>>> This comes at a cost: the offsets of revisions in the pack file are stored
>>> separately, and thus require an additional open/seek to get that information.
>>> Also, determining whether a revision is stored in a pack file or not also
>>> requires additional I/O. I think that most of this can be eliminated with
>>> caching and heuristics, but those haven't yet been implemented.
>>> I'm not currently planning on including this functionality in 1.6, as it's kinda
>>> biggish feature, the optimizations aren't yet in place, and I feel like merging
>>> right before we branch 1.6.x could be a bit destabilizing. However, I could
>>> easily be talked into it. :)
>>> Anyway, I'm soliciting feedback on the implementation and usage of this feature.
>>> Comments welcome.
>> This is something I'll definitely need. We're going to be having multiple
>> repositories each with a million revisions, so having fsfs packing will make the
>> repository much easier to work with. Also, any fsck's will be much faster :)
>> So +1 for merging into trunk from me.
> Do we have any cost/benefit numbers to demonstrate the "much easier to
> work with", or a qualitative description that you could point me to? I
> couldn't find anything by searching for "pack" or "inode" in the mail
The basic benefit is the elimination of what I term "the inode problem" Any
file on disk is physically stored on a set of sectors, which are of fixed size.
It depends on the file system, but most sectors can not store parts of multiple
files. *Every* file requires at least one sector, so if a file only contains 1
byte of actual data, it still reserves an entire sector on disk (though again
this varies by file system). The size of a sector may range from 512 bytes to
32 kB or more. IIRC, ext3 uses 4 kB sectors.
Because we store lots of little files with FSFS, we take a space hit. Using an
example sector size of 4kB, and assuming the size of the files modulo the sector
size is uniformly distributed, we can probabilistically state a 2kB waste of
space *per revision*. For our repository, that would mean 34.5 k * 2 kB = ~70
MB of wasted space, in a ~450 MB repository, by no means trivial. By
smashing the little files together, we reduce this internal fragmentation in the
file system, and increase space efficiency. Additionally, larger files are
easier for the file system and operating system to handle in many cases, and may
help with our own internal caching.
I haven't yet run a comprehensive performance analysis on a packed repository
vs. a non-packed one, so I don't have any quantitative numbers yet (other than
I hope this helps!
 This number is actually much larger due to revprops. These do *not* have a
uniform distribution, and are actually skewed to the left, meaning that the
average revprop file is wasting even more space. However, due to the mutability
of revprops, we can't (yet) pack them into immutable pack files. :(
Received on 2008-11-28 08:20:43 CET