----- Original Message -----
> From: Johan Corveleyn <jcorvel_at_gmail.com>
> To: Ashod Nakashian <ashodnakashian_at_yahoo.com>
> Cc: "dev_at_subversion.apache.org" <dev_at_subversion.apache.org>
> Sent: Monday, March 26, 2012 3:10 AM
> Subject: Re: Compressed Pristines (Simulation)
> On Sun, Mar 25, 2012 at 7:17 PM, Ashod Nakashian
> <ashodnakashian_at_yahoo.com> wrote:
>>> From: Hyrum K Wright <hyrum.wright_at_wandisco.com>
>>> In some respects, it looks like you're solving *two* problems:
>>> compression and the internal fragmentation due to large FS block
>>> sizes.á How orthogonal are the problems?á Could they be solved
>>> independently of each other in some way?á I know that compression
>>> exposes the internal fragmentation issue, but used alone it certainly
>>> doesn't make things *worse* does it?
>> Compression exposes internal fragmentation and, yes, it makes it *worse*
> (see hard numbers below).
>> Therefore, compression and internal fragmentation are orthogonal only if we
> care about absolute savings. (In other words, compressed files cause more
> internal fragmentation, but overall footprint is still reduced, however not as
> efficiently as ultimately possible.)
> By "doesn't make things worse", maybe Hyrum meant that compression
> doesn't magically cause more blocks to be used because of
> fragmentation. I mean, sure there is more fragmentation relative to
> the amount of data, but that's just because the amount of data
> decreased, right? Anyway, it depends on how you look at it, not too
Yes. I should've made myself clearer. I meant that's the case BUT the opportunity for further reduction in disk space is also increased (which is a prime interest in this feature).
>> Since the debated techniques (individual gz files vs packing+gz) for
> implementing Compressed Pristines are within reach with existing tools, and
> indeed some tests were done to yield hard figures (see older messages in this
> thread), it's reasonable to run simulations that can show with hard numbers
> the extent to which the speculations and estimations done (mostly by yours
> truly) regarding the advantages of a custom pack file are justifiable.
> I didn't read the design doc yet (it's a bit too big for me at the
> moment, I'm just following the dev-threads), so sorry if I'm saying
You should read it :-) As you'll see, some of your points are what's suggested.
> Wouldn't gz+packing be another interesting compromise? It wouldn't
> exploit inter-file similarities, but it would yield compression and
> reduced fragmentation. Can you test that as well? Maybe that gets us
> "80% gain for 20% of the effort" ...
Yes. And that's a "stage" in the implementation of this feature. We don't have to go for the optimum implementation right-away, just packing will help. What's argued is whether or not a custom file format (the pack file) is necessary or not.
> I'm certainly not an expert, but intuitively the packing+gz approach
> seems difficult to me, if only because you need to uncompress a full
> pack file to be able to read a single pristine (since offsets are
> relative to the uncompressed stream). So the advantage of exploiting
> inter-file similarities better be worth it.
To avoid that, we'll compress individual blocks (each containing at least 1 file, and potentially many to exploit inter-file similarities).
> When going for gz+packing, there is no need to uncompress an entire
> pack file just to start reading a single pristine. You can just keep
> offsets (in wc.db or whatever) to where the individual compressed
> pristines exist inside the pack file.
Indeed, it's what's proposed. Although there are two suggestions: a custom index file that dumps structures that can be reloaded fast or wc.db.
> Why not simply compress the "shards" we already have in the pristine
> store (sharded by the first two characters of the pristine checksum)?
> Or do we run the risk that such compressed shards are going to become
> too large (e.g. larger than 2 GB), and we want to avoid such a thing?
May be too large (as you suspected), what unites the files within a shard is their hash and not contents, requires the same infrastructure to locate a file, same overhead of managing inserts/deletes... so doesn't have any advantage and all the same problems.
The proposal is to have a custom file format that is simple and supports all our requirements out of the box and is file-name and file-type aware and can exploit all that if necessary.
Received on 2012-03-26 05:40:28 CEST