I don't know how I missed this mail, but I have. Thanks to Ivan for bringing it to my attention. Please see inline.
----- Original Message -----
> From: Branko Čibej <brane_at_e-reka.si>
> To: dev_at_subversion.apache.org
> Cc:
> Sent: Sunday, March 25, 2012 4:02 AM
> Subject: Re: Compressed Pristines (Design Doc)
>
> On 22.03.2012 17:01, Branko Čibej wrote:
>> On 22.03.2012 16:50, Daniel Shahaf wrote:
>>> Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100:
>>>> It's called SQLite.
>>> Heh. I wondered whether I should mention that the server uses BDB to
>>> store pristine files. (yes, the situation there is different in
>>> several relevant ways)
>> To clarify: I'm /not/ advocating that we store each and every file into
>> an SQLite BLOB. Files larger than several block sizes would be better
>> off on disk as real files (the compressor can, e.g., buffer compressed
>> contents up to, say, 32k, and if they become larger, spill directly into
>> a file; otherwise, dump into a BLOB). If we don't care about shared
>> pristine store, we don't even need a separate database, these blobs can
>> go into wc.db (which, as Greg points out, also serves as an index).
>>
>
> Since we need a few datapoints, I made a quick test to see what kind of
> space savings we can get with SQLite. Note that I've not tried any
> auto-vacuum settings, because my test only does insertions.
>
> I used a checkout of the current HTTPD trunk for my data set, and
> compressed all pristines, then moved them into a SQLite database
> depending on size; first, all compressed files 8k or smaller, next, all
> compressed files 32k or smaller. Note that my script does not prune
> empty directories from the pristine fanout. Here's the log:
>
> brane@zulu:~/src/httpd$ svn co http://svn.apache.org/repos/asf/httpd/httpd/trunk
> [...]
> U trunk
> Checked out revision 1305001.
> brane_at_zulu:~/src/httpd$ find trunk/.svn/pristine -type f | wc -l
> 3114
> brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
> 42M trunk/.svn/pristine/
> time gzip `find ./trunk/.svn/pristine -name '*.svn-base'`
>
> real 0m14.569s
> user 0m1.282s
> sys 0m0.747s
> brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
> 17M trunk/.svn/pristine/
> brane_at_zulu:~/src/httpd$ find trunk/.svn/pristine -size -8k -type f | wc -l
> 2856
> #
> # N.B.: 8k max size per blob
> #
> brane_at_zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/
>
> real 0m29.683s
> user 0m0.533s
> sys 0m1.641s
> brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
> 4.7M trunk/.svn/pristine/
> brane_at_zulu:~/src/httpd$ ll trunk/.svn/pristine//pristine.db
> -rw-r--r-- 1 brane staff 322560 Mar 25 12:43 ps/pristine.db
> #
> # N.B.: 32k max size per blob
> #
> brane_at_zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/
>
> real 0m23.831s
> user 0m0.529s
> sys 0m1.616s
> brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
> 1.2M trunk/.svn/pristine/
>
>
> The pristine.py script is attached.
>
> Based on these observations, it's clear that the implementation should
> proceed as follows:
>
> Step 1: Just compress the pristine files, do not use any packing. This
> gives a 60% decrease in disk usage in the HTTPD case, but even if the
> decrease is only 30%, it's still worth the effort.
>
> Step 2: Store small (for some definition of "small") compressed
> pristine
> files in a SQLite database. In the case of HTTPD, this gives an exter up
> to 90% savings in disk usage, but this is a very specific test case and
> it's hard to guess what kind of gain we'd get on average.
>
> All in all, looking at these number, there's a /looong/ way to go before
> we start playing with custom pack formats and compression of packed
> similar files. I'm not at all sure we'll ever really need the potential
> space savings of these methods, especially compared to the obvious risk
> to WC stability that writing and testing such code obviously brings.
Please see the numbers based on my simulation. I haven't ran HTTPD, but I suspect it'll yield results similar to that of SVN trunk. The simulation shows what gains there are if we use a custom pack file, as opposed to other approaches. The purpose is to avoid stipulation and have concrete figures. As such, and with this advantage at hand, I propose we make a final decision. There is no reason to implement in-place compression for "a while" and then move to packed files (similar to Git's experience) when we know before hand if it'd be worthwhile to pack or not. I don't mind implementing things in stages and with good testing at the end of each, but I'd be less inclined to release such intermediate stages to the public and similarly I'd be less encouraged to re-implement later to improve things when we can decide now with the same confidence as any other time, given the data we've already collected. If packing doesn't yield enough savings to warrant it,
then let's abandon it and never look back.
As such, I'd like to ask you to review your conclusion based on the following numbers. In particular, I'd like to point your attention at the larger projects (GCC, WebKit and OO) where the number of small files is truly large and will make wc.db rather large and bloated (actually, average file-size in the simulated projects is less than 8KB with the exception of SVN. GCC is less than 2KB after compression!). I don't know how typical these projects are.
Finally, I'd like to call on the community to vote on a design. It's a topic that has been discussed on and off for almost a decade. There is no doubt that it's a feature many look forward to, but we should not over design it and bloat our codebase. Having said that, I think it's fair to say we have much more information and data at this point to make a very educated judgment, if I may say so myself. Branko's sqlite python script combined with my bash simulation can really create something very close to a real implementation with mix-n-match of the schemes suggested and discussed.
*Let's vote on a design and a plan to get us there.* At this point I can only wait for everyone to catch up, do their tests and run their numbers and vote on what they think will best benefit the community in the long run.
::Subversion Compressed Pristines Simulation::
Disk block-size is 4096 bytes.
Checking out http://svn.apache.org/repos/asf/subversion/trunk/
svn: .svn-base is 44730415 bytes in 1882 files, 23767 average file size. Sub-block waste 3942353 (8.81%)
svn: .svn-base.GZ is 16044735 bytes in 1882 files, 8525 average gz file size. Sub-block waste 4644161 (28.94%)
svn: .svn-base.BZ2 is 15146221 bytes in 1882 files, 8047 average bzip2 file size. Sub-block waste 4563731 (30.13%)
svn: .svn-base.GZ saves 27983872 bytes, .svn-base.BZ2 saves 28962816 bytes (57.49% and 59.50%) respectively.
svn: .svn-base is 48672768 bytes on disk, .svn-base.GZ is 20688896 bytes on disk, .svn-text.BZ2 is 19709952 bytes on disk.
svn:
GZ Pack is 14295040 bytes on disk, BZ2 Pack is 13062144 bytes on disk.
Saved 34377728 and 35610624 bytes (70.63% and 73.16%) respectively.
svn: GZ pack saves 6393856 bytes (69.09%) compared to .svn-base.GZ.
svn: BZ2 pack saves 6647808 bytes (66.27%) compared to .svn-base.BZ2.
svn:
GZ Sorted Pack is 13832192 bytes on disk, BZ2 Sorted Pack is 12128256
bytes on disk. Saved 34840576 and 36544512 bytes (71.58% and 75.08%)
respectively.
svn: pristine files: 48672768 bytes on disk.
svn: in-place gz: 20688896 bytes (42.50% of original)
svn: in-place bz2: 19709952 bytes (40.49% of original)
svn: packed gz: 14295040 bytes (29.36% of original)
svn: packed bz2: 13062144 bytes (26.83% of original)
svn: sorted packed gz: 13832192 bytes (28.41% of original)
svn: sorted packed bz2: 12128256 bytes (24.91% of original)
Checking out svn://gcc.gnu.org/svn/gcc/trunk
gcc: .svn-base is 444058945 bytes in 75223 files, 5903 average file size. Sub-block waste 217023167 (48.87%)
gcc: .svn-base.GZ is 137009307 bytes in 75223 files, 1821 average gz file size. Sub-block waste 247084901 (180.34%)
gcc: .svn-base.BZ2 is 127560017 bytes in 75223 files, 1695 average bzip2 file size. Sub-block waste 246343343 (193.11%)
gcc: .svn-base.GZ saves 276987904 bytes, .svn-base.BZ2 saves 287178752 bytes (41.89% and 43.44%) respectively.
gcc: .svn-base is 661082112 bytes on disk, .svn-base.GZ is 384094208 bytes on disk, .svn-text.BZ2 is 373903360 bytes on disk.
gcc:
GZ Pack is 109268992 bytes on disk, BZ2 Pack is 91897856 bytes on disk.
Saved 551813120 and 569184256 bytes (83.47% and 86.09%) respectively.
gcc: GZ pack saves 274825216 bytes (28.44%) compared to .svn-base.GZ.
gcc: BZ2 pack saves 282005504 bytes (24.57%) compared to .svn-base.BZ2.
gcc:
GZ Sorted Pack is 94666752 bytes on disk, BZ2 Sorted Pack is 73170944
bytes on disk. Saved 566415360 and 587911168 bytes (85.68% and 88.93%)
respectively.
gcc: pristine files: 661082112 bytes on disk.
gcc: in-place gz: 384094208 bytes (58.10% of original)
gcc: in-place bz2: 373903360 bytes (56.55% of original)
gcc: packed gz: 109268992 bytes (16.52% of original)
gcc: packed bz2: 91897856 bytes (13.90% of original)
gcc: sorted packed gz: 94666752 bytes (14.31% of original)
gcc: sorted packed bz2: 73170944 bytes (11.06% of original)
Checking out http://svn.webkit.org/repository/webkit/trunk
webkit: .svn-base is 1752698337 bytes in 153298 files, 11433 average file size. Sub-block waste 385610271 (22.00%)
webkit: .svn-base.GZ is 1142138194 bytes in 153298 files, 7450 average gz file size. Sub-block waste 456489646 (39.96%)
webkit: .svn-base.BZ2 is 1159101266 bytes in 153298 files, 7561 average bzip2 file size. Sub-block waste 455017646 (39.25%)
webkit: .svn-base.GZ saves 539680768 bytes, .svn-base.BZ2 saves 524189696 bytes (25.23% and 24.51%) respectively.
webkit:
.svn-base is 2138308608 bytes on disk, .svn-base.GZ is 1598627840 bytes
on disk, .svn-text.BZ2 is 1614118912 bytes on disk.
webkit:
GZ Pack is 1112465408 bytes on disk, BZ2 Pack is 1098600448 bytes on
disk. Saved 1025843200 and 1039708160 bytes (47.97% and 48.62%)
respectively.
webkit: GZ pack saves 486162432 bytes (69.58%) compared to .svn-base.GZ.
webkit: BZ2 pack saves 515518464 bytes (68.06%) compared to .svn-base.BZ2.
webkit:
GZ Sorted Pack is 1178726400 bytes on disk, BZ2 Sorted Pack is
1093103616 bytes on disk. Saved 959582208 and 1045204992 bytes (44.87%
and 48.87%) respectively.
webkit: pristine files: 2138308608 bytes on disk.
webkit: in-place gz: 1598627840 bytes (74.76% of original)
webkit: in-place bz2: 1614118912 bytes (75.48% of original)
webkit: packed gz: 1112465408 bytes (52.02% of original)
webkit: packed bz2: 1098600448 bytes (51.37% of original)
webkit: sorted packed gz: 1178726400 bytes (55.12% of original)
webkit: sorted packed bz2: 1093103616 bytes (51.12% of original)
Checking out https://svn.apache.org/repos/asf/incubator/ooo/trunk
svn: E175002: REPORT of '/repos/asf/!svn/me': Could not read response body: Secure connection truncated (https://svn.apache.org)
ooo: .svn-base is 963601263 bytes in 61880 files, 15572 average file size. Sub-block waste 144469137 (14.99%)
ooo: .svn-base.GZ is 458977742 bytes in 61880 files, 7417 average gz file size. Sub-block waste 172936754 (37.67%)
ooo: .svn-base.BZ2 is 457007742 bytes in 61880 files, 7385 average bzip2 file size. Sub-block waste 170614146 (37.33%)
ooo: .svn-base.GZ saves 476155904 bytes, .svn-base.BZ2 saves 480448512 bytes (42.97% and 43.35%) respectively.
ooo: .svn-base is 1108070400 bytes on disk, .svn-base.GZ is 631914496 bytes on disk, .svn-text.BZ2 is 627621888 bytes on disk.
ooo:
GZ Pack is 429735936 bytes on disk, BZ2 Pack is 418869248 bytes on
disk. Saved 678334464 and 689201152 bytes (61.21% and 62.19%)
respectively.
ooo: GZ pack saves 202178560 bytes (68.00%) compared to .svn-base.GZ.
ooo: BZ2 pack saves 208752640 bytes (66.73%) compared to .svn-base.BZ2.
ooo:
GZ Sorted Pack is 413577216 bytes on disk, BZ2 Sorted Pack is 381874176
bytes on disk. Saved 694493184 and 726196224 bytes (62.67% and 65.53%)
respectively.
ooo: pristine files: 1108070400 bytes on disk.
ooo: in-place gz: 631914496 bytes (57.02% of original)
ooo: in-place bz2: 627621888 bytes (56.64% of original)
ooo: packed gz: 429735936 bytes (38.78% of original)
ooo: packed bz2: 418869248 bytes (37.80% of original)
ooo: sorted packed gz: 413577216 bytes (37.32% of original)
ooo: sorted packed bz2: 381874176 bytes (34.46% of original)
Checking out http://core.svn.wordpress.org/trunk/
wo: .svn-base is 11802194 bytes in 957 files, 12332 average file size. Sub-block waste 2341294 (19.83%)
wo: .svn-base.GZ is 5203459 bytes in 957 files, 5437 average gz file size. Sub-block waste 2542077 (48.85%)
wo: .svn-base.BZ2 is 5122614 bytes in 957 files, 5352 average bzip2 file size. Sub-block waste 2512330 (49.04%)
wo: .svn-base.GZ saves 6397952 bytes, .svn-base.BZ2 saves 6508544 bytes (45.23% and 46.01%) respectively.
wo: .svn-base is 14143488 bytes on disk, .svn-base.GZ is 7745536 bytes on disk, .svn-text.BZ2 is 7634944 bytes on disk.
wo:
GZ Pack is 4124672 bytes on disk, BZ2 Pack is 3801088 bytes on disk.
Saved 10018816 and 10342400 bytes (70.83% and 73.12%) respectively.
wo: GZ pack saves 3620864 bytes (53.25%) compared to .svn-base.GZ.
wo: BZ2 pack saves 3833856 bytes (49.78%) compared to .svn-base.BZ2.
wo:
GZ Sorted Pack is 3842048 bytes on disk, BZ2 Sorted Pack is 3493888
bytes on disk. Saved 10301440 and 10649600 bytes (72.83% and 75.29%)
respectively.
wo: pristine files: 14143488 bytes on disk.
wo: in-place gz: 7745536 bytes (54.76% of original)
wo: in-place bz2: 7634944 bytes (53.98% of original)
wo: packed gz: 4124672 bytes (29.16% of original)
wo: packed bz2: 3801088 bytes (26.87% of original)
wo: sorted packed gz: 3842048 bytes (27.16% of original)
wo: sorted packed bz2: 3493888 bytes (24.70% of original)
Cheers,
Ash
>
> Anyway, it's certain that creating this packed format is /not/ the first
> step to take.
>
> -- Brane
>
Received on 2012-03-26 12:47:55 CEST