Re: Blue-sky idea: Representation reuse

From: Rick Bradley <svn-dev_at_rickbradley.com>
Date: 2002-10-09 20:57:23 CEST

* Greg Hudson (ghudson@MIT.EDU) [021009 12:12]:
> An even more blue-sky idea, which Mike came up with, is to have a
> background daemon running around looking for unexploited rep
> similarities and redeltifying things. I think this is too hard
> (there's no good heuristic for finding similar files except by doing
> O(n^2) comparisons... though you might be able to play games with file
> sizes or statistical characteristics of the contents, I suppose), but
> it's an option.

There's a much better way than O(n^2) as long as you have a signature
function S() that suits your needs (md5, nilsimsa, soundex :-), etc.).

Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
in F compute the signature s_i = S(f_i) and store it in a hash along
with i. Whenever an insert into the hash finds a prior string f_j
there, then merge f_i and f_j in subversion.

This has naive run-time O(n * (T(hash) + T(S))), excluding the
complexity of the actual subversion merge, where T(hash) is the
amortized run-time for a hash lookup/insert operation, and T(S) is the
time to compute an S(f) signature. Given reasonable implementations for
the underlying hash data structure, and a sensible hashing algorithm
this looks a lot like O(n) time. Note that I'm obscuring a hidden
union-find problem in tracking repeated merges but this will not
significantly affect the run-time, and it's not clear to me at the
moment that union-find is necessary for repeated merging in subversion
if repository images are processed in temporal order.

Rick

-- 
 http://www.rickbradley.com    MUPRN: 862    (68F/68F)
                       |  some of the answers
   random email haiku  |  I received from the research
                       |  side of the business.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Wed Oct 9 20:58:21 2002

This message: [ Message body ]
Next message: Noel Yap: "Re: ClearCase-like repo structure (Was: Accessing different revs via Apache?)"
Previous message: Peter Davis: "Re: ClearCase-like repo structure (Was: Accessing different revs via Apache?)"
In reply to: Greg Hudson: "Blue-sky idea: Representation reuse"
Next in thread: Branko ÄŒibej: "Re: Blue-sky idea: Representation reuse"
Reply: Branko ÄŒibej: "Re: Blue-sky idea: Representation reuse"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]