RE: Blue-sky idea: Representation reuse
From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-10-09 20:11:05 CEST
Yeah, I've thought along similar things before. Having representation
Certainly something to keep in the back of our minds when deciding what
Bill
---- Do you want a dangerous fugitive staying in your flat? No. Well, don't upset him and he'll be a nice fugitive staying in your flat. > -----Original Message----- > From: Greg Hudson [mailto:ghudson@MIT.EDU] > Sent: Wednesday, October 09, 2002 10:10 AM > To: dev@subversion.tigris.org > Subject: Blue-sky idea: Representation reuse > > Here's an idea for how to reap space savings for some kinds of use > cases. Basically, a version control system tries to save repository > space in two ways: > > 1. If two files are identical, only store one copy. > 2. If two files are similar, store one as a diff against the other. > > Like most revision control systems, we rely on a file's history to > find these advantages. We only look for similarity or identity > between a file and its most recent revision (or its copy history). > So, for example, if you do: > > svn import http://a/b linux-1.0 linux-1.0 > svn import http://a/b linux-1.1 linux-1.1 > ... > > you won't get any space savings except from self-compression. > (Instead, you have to do all your imports to the same place and > interleave those with copies.) Similarly, right now if you branch a > file, make extensive modifications to the file on one branch, and then > merge those mods onto the other branch, the repository records those > diffs twice instead of reusing the representation. > > But it doesn't have to be that way. It's hard to look for similarity > with random other files in the repository, but it's easy to look for > identity. We can maintain a mapping from checksum to representation. > When you store a representation, if you find a match in the checksum > table, you verify that it's the same contents and use that rep > instead. > > There are a few complications: > > * Right now, when we abort a transaction, we remove all the reps it > refers to in mutable nodes. We'll need to keep track of when we > reused a representation to avoid destroying existing data. > > * We'll have to think about how redeltification interacts with > representation reuse. > > This idea shouldn't be too hard to implement, but it's not clear how > much it would help the average user. So it's up in the air whether > it's worth the complexity. > > An even more blue-sky idea, which Mike came up with, is to have a > background daemon running around looking for unexploited rep > similarities and redeltifying things. I think this is too hard > (there's no good heuristic for finding similar files except by doing > O(n^2) comparisons... though you might be able to play games with file > sizes or statistical characteristics of the contents, I suppose), but > it's an option. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org > For additional commands, e-mail: dev-help@subversion.tigris.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org For additional commands, e-mail: dev-help@subversion.tigris.orgReceived on Wed Oct 9 20:11:54 2002 |
This is an archived mail posted to the Subversion Dev mailing list.
This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.