[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Blue-sky idea: Representation reuse

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-10-09 20:11:05 CEST

Yeah, I've thought along similar things before. Having representation
reuse for full texts alone would be a good thing. However, we don't have
a handy index against the representation checksums. Let alone, an easy
way to join from Representations to the set of NodeRevisions that use
the Representation. Lacking this easy join is one of the blocking issues
before even considering implementing an "svn obliterate" for example.

Certainly something to keep in the back of our minds when deciding what
do to for post-1.0 schema.

Bill

----
Do you want a dangerous fugitive staying in your flat?
No.
Well, don't upset him and he'll be a nice fugitive staying in your flat.
 
> -----Original Message-----
> From: Greg Hudson [mailto:ghudson@MIT.EDU]
> Sent: Wednesday, October 09, 2002 10:10 AM
> To: dev@subversion.tigris.org
> Subject: Blue-sky idea: Representation reuse
> 
> Here's an idea for how to reap space savings for some kinds of use
> cases.  Basically, a version control system tries to save repository
> space in two ways:
> 
>   1. If two files are identical, only store one copy.
>   2. If two files are similar, store one as a diff against the other.
> 
> Like most revision control systems, we rely on a file's history to
> find these advantages.  We only look for similarity or identity
> between a file and its most recent revision (or its copy history).
> So, for example, if you do:
> 
>   svn import http://a/b linux-1.0 linux-1.0
>   svn import http://a/b linux-1.1 linux-1.1
>   ...
> 
> you won't get any space savings except from self-compression.
> (Instead, you have to do all your imports to the same place and
> interleave those with copies.)  Similarly, right now if you branch a
> file, make extensive modifications to the file on one branch, and then
> merge those mods onto the other branch, the repository records those
> diffs twice instead of reusing the representation.
> 
> But it doesn't have to be that way.  It's hard to look for similarity
> with random other files in the repository, but it's easy to look for
> identity.  We can maintain a mapping from checksum to representation.
> When you store a representation, if you find a match in the checksum
> table, you verify that it's the same contents and use that rep
> instead.
> 
> There are a few complications:
> 
>   * Right now, when we abort a transaction, we remove all the reps it
>     refers to in mutable nodes.  We'll need to keep track of when we
>     reused a representation to avoid destroying existing data.
> 
>   * We'll have to think about how redeltification interacts with
>     representation reuse.
> 
> This idea shouldn't be too hard to implement, but it's not clear how
> much it would help the average user.  So it's up in the air whether
> it's worth the complexity.
> 
> An even more blue-sky idea, which Mike came up with, is to have a
> background daemon running around looking for unexploited rep
> similarities and redeltifying things.  I think this is too hard
> (there's no good heuristic for finding similar files except by doing
> O(n^2) comparisons... though you might be able to play games with file
> sizes or statistical characteristics of the contents, I suppose), but
> it's an option.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 20:11:54 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.