On Sun, Mar 25, 2012 at 7:17 PM, Ashod Nakashian
<ashodnakashian_at_yahoo.com> wrote:
[snip]
>> From: Hyrum K Wright <hyrum.wright_at_wandisco.com>
[snip]
>>In some respects, it looks like you're solving *two* problems:
>>compression and the internal fragmentation due to large FS block
>>sizes. How orthogonal are the problems? Could they be solved
>>independently of each other in some way? I know that compression
>>exposes the internal fragmentation issue, but used alone it certainly
>>doesn't make things *worse* does it?
>
> Compression exposes internal fragmentation and, yes, it makes it *worse* (see hard numbers below).
> Therefore, compression and internal fragmentation are orthogonal only if we care about absolute savings. (In other words, compressed files cause more internal fragmentation, but overall footprint is still reduced, however not as efficiently as ultimately possible.)
By "doesn't make things worse", maybe Hyrum meant that compression
doesn't magically cause more blocks to be used because of
fragmentation. I mean, sure there is more fragmentation relative to
the amount of data, but that's just because the amount of data
decreased, right? Anyway, it depends on how you look at it, not too
important.
[snip]
>
> Since the debated techniques (individual gz files vs packing+gz) for implementing Compressed Pristines are within reach with existing tools, and indeed some tests were done to yield hard figures (see older messages in this thread), it's reasonable to run simulations that can show with hard numbers the extent to which the speculations and estimations done (mostly by yours truly) regarding the advantages of a custom pack file are justifiable.
>
I didn't read the design doc yet (it's a bit too big for me at the
moment, I'm just following the dev-threads), so sorry if I'm saying
nonsense.
Wouldn't gz+packing be another interesting compromise? It wouldn't
exploit inter-file similarities, but it would yield compression and
reduced fragmentation. Can you test that as well? Maybe that gets us
"80% gain for 20% of the effort" ...
I'm certainly not an expert, but intuitively the packing+gz approach
seems difficult to me, if only because you need to uncompress a full
pack file to be able to read a single pristine (since offsets are
relative to the uncompressed stream). So the advantage of exploiting
inter-file similarities better be worth it.
When going for gz+packing, there is no need to uncompress an entire
pack file just to start reading a single pristine. You can just keep
offsets (in wc.db or whatever) to where the individual compressed
pristines exist inside the pack file.
Why not simply compress the "shards" we already have in the pristine
store (sharded by the first two characters of the pristine checksum)?
Or do we run the risk that such compressed shards are going to become
too large (e.g. larger than 2 GB), and we want to avoid such a thing?
--
Johan
Received on 2012-03-26 00:11:43 CEST