[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Efficiency of rep-sharing (deduplication) in 1.8 and later

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Sat, 6 Dec 2014 11:17:05 +0000

Mark Phippard wrote on Fri, Sep 12, 2014 at 11:24:43 -0400:
> On Fri, Sep 12, 2014 at 11:17 AM, Thomas Harold <thomas-lists_at_nybeta.com>
> wrote:
>
> > I have a question about how efficient SVN is at de-duplication within a
> > repository with regards to files that appear in multiple locations, but
> > which have the same content.
> >
> > I know a small improvement was made in 1.8...
> >
> > http://subversion.apache.org/docs/release-notes/1.8.html#fsfs-enhancements
> >
> > > When representation sharing has been enabled, Subversion 1.8 will now
> > > be able to detect files and properties with identical contents within
> > > the same revision and only store them once. This is a common
> > > situation when you for instance import a non-incremental dump file or
> > > when users apply the same change to multiple branches in a single
> > > commit.
> >
> > #1 - If a commit puts files A, B and C into the repository, and a latter
> > commit puts files B, C and D into the repository at a different
> > location, is SVN smart enough to realize that B and C are already stored
> > in the repository?
> >
> > In other words, does it track each individual file separately, even if
> > they were all part of one big revision?
> >
>
> Representation cache is based on the sha of the rep. So it does not matter
> what the filename is or where it is stored. If it has the same sha as an
> existing rep, then it will be be shared.
>
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
>
> I think it is worth pointing out that a rep is not necessarily a "file".
> It is the specific delta that SVN would be storing in the repository DB.

The sha1 of the rep itself doesn't matter. The rep-cache.db file is a
cache of (sha1 of fulltext ↦ location of rep generating that fulltext).

As to the idea of doing the sha1 at chunk level rather than at file
level: I suggest to discuss that on dev@. Some backend devs might
otherwise miss the discussion.

Cheers,

Daniel
Received on 2014-12-06 12:22:05 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.