On 22.03.2012 17:01, Branko ÄŒibej wrote:
> On 22.03.2012 16:50, Daniel Shahaf wrote:
>> Branko ÄŒibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100:
>>> It's called SQLite.
>> Heh. I wondered whether I should mention that the server uses BDB to
>> store pristine files. (yes, the situation there is different in
>> several relevant ways)
> To clarify: I'm /not/ advocating that we store each and every file into
> an SQLite BLOB. Files larger than several block sizes would be better
> off on disk as real files (the compressor can, e.g., buffer compressed
> contents up to, say, 32k, and if they become larger, spill directly into
> a file; otherwise, dump into a BLOB). If we don't care about shared
> pristine store, we don't even need a separate database, these blobs can
> go into wc.db (which, as Greg points out, also serves as an index).
>
Since we need a few datapoints, I made a quick test to see what kind of
space savings we can get with SQLite. Note that I've not tried any
auto-vacuum settings, because my test only does insertions.
I used a checkout of the current HTTPD trunk for my data set, and
compressed all pristines, then moved them into a SQLite database
depending on size; first, all compressed files 8k or smaller, next, all
compressed files 32k or smaller. Note that my script does not prune
empty directories from the pristine fanout. Here's the log:
brane@zulu:~/src/httpd$ svn co http://svn.apache.org/repos/asf/httpd/httpd/trunk
[...]
U trunk
Checked out revision 1305001.
brane_at_zulu:~/src/httpd$ find trunk/.svn/pristine -type f | wc -l
3114
brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
42M trunk/.svn/pristine/
time gzip `find ./trunk/.svn/pristine -name '*.svn-base'`
real 0m14.569s
user 0m1.282s
sys 0m0.747s
brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
17M trunk/.svn/pristine/
brane_at_zulu:~/src/httpd$ find trunk/.svn/pristine -size -8k -type f | wc -l
2856
#
# N.B.: 8k max size per blob
#
brane_at_zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/
real 0m29.683s
user 0m0.533s
sys 0m1.641s
brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
4.7M trunk/.svn/pristine/
brane_at_zulu:~/src/httpd$ ll trunk/.svn/pristine//pristine.db
-rw-r--r-- 1 brane staff 322560 Mar 25 12:43 ps/pristine.db
#
# N.B.: 32k max size per blob
#
brane_at_zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/
real 0m23.831s
user 0m0.529s
sys 0m1.616s
brane_at_zulu:~/src/httpd$ du -sh trunk/.svn/pristine/
1.2M trunk/.svn/pristine/
The pristine.py script is attached.
Based on these observations, it's clear that the implementation should
proceed as follows:
Step 1: Just compress the pristine files, do not use any packing. This
gives a 60% decrease in disk usage in the HTTPD case, but even if the
decrease is only 30%, it's still worth the effort.
Step 2: Store small (for some definition of "small") compressed pristine
files in a SQLite database. In the case of HTTPD, this gives an exter up
to 90% savings in disk usage, but this is a very specific test case and
it's hard to guess what kind of gain we'd get on average.
All in all, looking at these number, there's a /looong/ way to go before
we start playing with custom pack formats and compression of packed
similar files. I'm not at all sure we'll ever really need the potential
space savings of these methods, especially compared to the obvious risk
to WC stability that writing and testing such code obviously brings.
Anyway, it's certain that creating this packed format is /not/ the first
step to take.
-- Brane
Received on 2012-03-26 07:27:05 CEST