Hyrum K. Wright wrote:
> The basic benefit is the elimination of what I term "the inode problem" Any
> file on disk is physically stored on a set of sectors, which are of fixed size.
> It depends on the file system, but most sectors can not store parts of multiple
> files. *Every* file requires at least one sector, so if a file only contains 1
> byte of actual data, it still reserves an entire sector on disk (though again
> this varies by file system). The size of a sector may range from 512 bytes to
> 32 kB or more. IIRC, ext3 uses 4 kB sectors.
>
The inode problem is a different problem - each file system often has
inode slots allocated up front, and if they become exhausted, no new
files can be created, even if the disk isn't fully in use. FSFS packing
will definitely help to address the inode problem in that 1000 commits
will now only take up 1 inode instead of 1000 inodes.
For the problem you are talking about, they usually call these blocks
(sectors are normally the unit that a hard drive is accessed with,
almost always 512 bytes) and it is a matter of efficiency, in that the
smaller the block size, the more administration data needs to be
maintained and analyzed by the file system whenever doing any file
systems operations. It's a compromise between performance and space, and
4 Kbytes usually works well for most applications. In a real life
example around here, summing the actual byte size of the files for a
large enough Subversion respositories revs directory gives us 1848217k,
but du -ks gives us 1905916k. Somebody might say "3% savings potential
by packing the files? That sounds great!" This number is a bit
misleading though, as it is telling us the amount used from the file
system content area, but is not telling us the amount of administrative
data required to manage these files. For example, we could reformat the
ext3 partition to use a block size of 2k instead of 4k, and effectively
cut in half the amount of "tail bytes" that are not using full blocks at
the end of most files. But what if the cost of cutting the tail bytes in
half is to double the size of the administration data required to manage
the blocks on disk, and decrease performance for determining which block
to read, or which block to allocate? Can FSFS packing use less
administrative + content overhead? Probably, but we're still talking
about the 3% range.
I'm more interested in Subversion achieving something on the order of
what git packs achieve. That is, since content within commits for the
same repository in the same span of time probably include a large number
of similarities, can these similarities be compressed in some way that
would still allow them to be accessed in a random order with low
extraction overhead? I would like to see a 20X reduction in size on
disk, rather than a 3% reduction in size on disk. Too optimistic? :-)
> [1] This number is actually much larger due to revprops. These do *not* have a
> uniform distribution, and are actually skewed to the left, meaning that the
> average revprop file is wasting even more space. However, due to the mutability
> of revprops, we can't (yet) pack them into immutable pack files. :(
>
Based on the same repo I mention above, the size of all the revprops
summed together is 3682k, but the size on disk is 100348k. This means
about 96.3% wasted space for revprops.
Revprops definitely have more significance than revs for shard packing
with a 27X reduction in space based on the above numbers, even if one
doesn't do something like gzip -1 of the content. :-)
Cheers,
mark
--
Mark Mielke <mark_at_mielke.cc>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-11-28 17:18:46 CET