>
> Representation cache is based on the sha of the rep. So it does not
> matter what the filename is or where it is stored. If it has the same
> sha as an existing rep, then it will be be shared.
>
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
>
> I think it is worth pointing out that a rep is not necessarily a "file".
> It is the specific delta that SVN would be storing in the repository DB.
>
One improvement that I'd like to suggest is that files over 1MiB (4? 8?)
be "chunked" prior to calculating rep-sharing.
http://blog.clearpathsg.com/blog/bid/254076/Understanding-Variable-Length-Deduplication
My thinking is that there might be storage gains to be made if
rep-sharing is done at a lower level then the file level in cases of
files over a particular size. For instance, if you commit a few hundred
files of mid-size (5-15MB or larger), there is probably a lot of
identical data between them (if the files are not already compressed).
Those identical chunks could be possibly found via a variable length
deduplication algorithm and deduped across the repository.
IIRC when I moved our repos from 1.6 to 1.8 format, space usage went
down by 10-15% from rep-sharing. I wouldn't mind having another 5-10%
space savings.
Received on 2014-12-03 16:48:09 CET