[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Blue-sky idea: Representation reuse

From: Branko Čibej <brane_at_xbc.nu>
Date: 2002-10-09 20:20:45 CEST

Greg Hudson wrote:

>Here's an idea for how to reap space savings for some kinds of use
>cases. Basically, a version control system tries to save repository
>space in two ways:
>
> 1. If two files are identical, only store one copy.
> 2. If two files are similar, store one as a diff against the other.
>
>Like most revision control systems, we rely on a file's history to
>find these advantages. We only look for similarity or identity
>between a file and its most recent revision (or its copy history).
>So, for example, if you do:
>
> svn import http://a/b linux-1.0 linux-1.0
> svn import http://a/b linux-1.1 linux-1.1
> ...
>
>you won't get any space savings except from self-compression.
>(Instead, you have to do all your imports to the same place and
>interleave those with copies.) Similarly, right now if you branch a
>file, make extensive modifications to the file on one branch, and then
>merge those mods onto the other branch, the repository records those
>diffs twice instead of reusing the representation.
>
>But it doesn't have to be that way. It's hard to look for similarity
>with random other files in the repository, but it's easy to look for
>identity. We can maintain a mapping from checksum to representation.
>When you store a representation, if you find a match in the checksum
>table, you verify that it's the same contents and use that rep
>instead.
>
>There are a few complications:
>
> * Right now, when we abort a transaction, we remove all the reps it
> refers to in mutable nodes. We'll need to keep track of when we
> reused a representation to avoid destroying existing data.
>
That's a "imple" matter of adding a reference count to the rep

> * We'll have to think about how redeltification interacts with
> representation reuse.
>
Hah, yes, redeltifying along different history paths that just happen to
use the same representation could be fun. :-)

>This idea shouldn't be too hard to implement, but it's not clear how
>much it would help the average user. So it's up in the air whether
>it's worth the complexity.
>
Some repositories would see a huge benefit; others wouldn see any. Maybe
it would be best to make this optional, to be turned on when you create
a repository. "svnadmin create --squashed repo".

>An even more blue-sky idea, which Mike came up with, is to have a
>background daemon running around looking for unexploited rep
>similarities and redeltifying things. I think this is too hard
>(there's no good heuristic for finding similar files except by doing
>O(n^2) comparisons... though you might be able to play games with file
>sizes or statistical characteristics of the contents, I suppose), but
>it's an option.
>
This would probably be more useful as an offline utility; "svnadmin
compress"?

-- 
Brane Čibej   <brane_at_xbc.nu>   http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 20:21:50 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.