Re: Efficiency of rep-sharing (deduplication) in 1.8 and later (chunking?)

From: Thomas Harold <thomas-lists_at_nybeta.com>
Date: Wed, 03 Dec 2014 10:46:07 -0500

>
> Representation cache is based on the sha of the rep. So it does not
> matter what the filename is or where it is stored. If it has the same
> sha as an existing rep, then it will be be shared.
>
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
>
> I think it is worth pointing out that a rep is not necessarily a "file".
> It is the specific delta that SVN would be storing in the repository DB.
>

One improvement that I'd like to suggest is that files over 1MiB (4? 8?)
be "chunked" prior to calculating rep-sharing.

http://blog.clearpathsg.com/blog/bid/254076/Understanding-Variable-Length-Deduplication

My thinking is that there might be storage gains to be made if
rep-sharing is done at a lower level then the file level in cases of
files over a particular size. For instance, if you commit a few hundred
files of mid-size (5-15MB or larger), there is probably a lot of
identical data between them (if the files are not already compressed).
Those identical chunks could be possibly found via a variable length
deduplication algorithm and deduped across the repository.

IIRC when I moved our repos from 1.6 to 1.8 format, space usage went
down by 10-15% from rep-sharing. I wouldn't mind having another 5-10%
space savings.
Received on 2014-12-03 16:48:09 CET

This message: [ Message body ]
Next message: Rama A: "Re: Tunnel Agent"
Previous message: Johan Corveleyn: "Re: Problems with selecting log revision by date: -r {yyyy-mm-dd}"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]