[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Blue-sky idea: Representation reuse

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-10-09 19:10:13 CEST

Here's an idea for how to reap space savings for some kinds of use
cases. Basically, a version control system tries to save repository
space in two ways:

  1. If two files are identical, only store one copy.
  2. If two files are similar, store one as a diff against the other.

Like most revision control systems, we rely on a file's history to
find these advantages. We only look for similarity or identity
between a file and its most recent revision (or its copy history).
So, for example, if you do:

  svn import http://a/b linux-1.0 linux-1.0
  svn import http://a/b linux-1.1 linux-1.1
  ...

you won't get any space savings except from self-compression.
(Instead, you have to do all your imports to the same place and
interleave those with copies.) Similarly, right now if you branch a
file, make extensive modifications to the file on one branch, and then
merge those mods onto the other branch, the repository records those
diffs twice instead of reusing the representation.

But it doesn't have to be that way. It's hard to look for similarity
with random other files in the repository, but it's easy to look for
identity. We can maintain a mapping from checksum to representation.
When you store a representation, if you find a match in the checksum
table, you verify that it's the same contents and use that rep
instead.

There are a few complications:

  * Right now, when we abort a transaction, we remove all the reps it
    refers to in mutable nodes. We'll need to keep track of when we
    reused a representation to avoid destroying existing data.

  * We'll have to think about how redeltification interacts with
    representation reuse.

This idea shouldn't be too hard to implement, but it's not clear how
much it would help the average user. So it's up in the air whether
it's worth the complexity.

An even more blue-sky idea, which Mike came up with, is to have a
background daemon running around looking for unexploited rep
similarities and redeltifying things. I think this is too hard
(there's no good heuristic for finding similar files except by doing
O(n^2) comparisons... though you might be able to play games with file
sizes or statistical characteristics of the contents, I suppose), but
it's an option.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 19:10:48 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.