Re: Compressed Pristines (Summary)
From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Tue, 10 Apr 2012 00:57:55 -0700 (PDT)
Hi Justin,
Sorry for the late reply and thanks for your notes. I should say right off the bat that the design doc is outdated in terms of what we plan to do as a first implementation, although as a proposal for packed file format I think it's still mostly valid, except for a few notes and improvements (such as 64bit filesize) that are missing/invalid (regardless of whether or not we'll ultimately implement it or not).
See my notes and comments inline please.
>________________________________
The straightforward design is to have a single large pack file. But in practice this is very problematic. You can already find FSes that may barf on multi-GB files, but that aside, consider the overhead of removing prisitine files and shifting the bytes following it. The overhead is extreme. To avoid that, we need to track holes in the files (and incur the wasted space on disk) and (even worse) we need to do heavyweight lifting to accommodate new/modified pristines into these holes where they might not fit! In other words, we have to write a complex FS in a single file and we have to keep its size on disk small (to justify this feature!) and to do housekeeping as fast as possible (shifting GBs on disk because we have a largish hole at the beginning of the file has a cost).
My solution is to split the pack files such that each file is small enough to fit in memory and be written to disk in sub-second times. This way, 1) holes in these files can be avoided completely and swiftly, 2) even if we keep holes, they shouldn't/couldn't be too large.
> The whole point of stashing the small files directly into
I'm a bit confused here. You're assuming that we'll use Sqlite for small files and the FS for larger ones, I'm assuming. However that's not in the proposal, it's what we've agreed on on this list. We aren't going to implement both, not for now at least. What we're going to do is simply push small pristines into pristine.db and in-place compress the larger ones on disk (as a first implementation we'll probably even leave the names the same and change nothing other than passing the disk I/O through compressed streams). Beyond that, we will probably experiment with packing. But it's a bit soon to worry about it. Although any research or help is more than welcome!
>
It's all relative! Saying "multiple large pristines in one pack file" assumes too much. I find it better to first define/find/compute an order of pack file size that satisfies our reqs (my crude math finds that to be in the order of a few MBs - see proposal doc) then it follows automatically the largest pristine that can share a pack file with another. Anything smaller can share a pack file with others, and hence (by our definition!) aren't "too large". Larger ones are "really large" (again by our definition) and so will be compressed alone on disk (ignoring if there will be a pack-header or not is hopefully not debated for now) - practically these files will be in-place compressed, as a result.
As for assuming that we only/constantly append to a pack file, that's unfounded. Files may be removed from a pristine store upon svn up. Even if not, a file modification is reasonably implemented as remove+add. This is the correct way to do it because the files size might change and we need to do the same housekeeping as removing one file and adding an unrelated one. Granted, there is room for improving this. In other words, knowing it's the same prisitine file modified a bit doesn't give us much information to be of practical use.
>
All welcome notes. We will get back to these issues when we have a working version that we can play and experiment with. There is certainly too many things to worry about and perhaps even more tempting points to toy with. To be pragmatic (and productive) I want to focus on getting the simplest working implementation that can justify this feature (i.e. one that does produce real disk savings without too much complexity or performance reduction). But points taken.
>
Agreed. I think it's reasonable to attempt at a no-compression type, but keep it abstracted away in the compression layer, not higher. I also agree it's a premature optimization, so we should do it when we have a working stack first.
-Ash
>
|
This is an archived mail posted to the Subversion Dev mailing list.
This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.