* Greg Hudson (ghudson@MIT.EDU) [021009 12:12]:
> An even more blue-sky idea, which Mike came up with, is to have a
> background daemon running around looking for unexploited rep
> similarities and redeltifying things. I think this is too hard
> (there's no good heuristic for finding similar files except by doing
> O(n^2) comparisons... though you might be able to play games with file
> sizes or statistical characteristics of the contents, I suppose), but
> it's an option.
There's a much better way than O(n^2) as long as you have a signature
function S() that suits your needs (md5, nilsimsa, soundex :-), etc.).
Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
in F compute the signature s_i = S(f_i) and store it in a hash along
with i. Whenever an insert into the hash finds a prior string f_j
there, then merge f_i and f_j in subversion.
This has naive run-time O(n * (T(hash) + T(S))), excluding the
complexity of the actual subversion merge, where T(hash) is the
amortized run-time for a hash lookup/insert operation, and T(S) is the
time to compute an S(f) signature. Given reasonable implementations for
the underlying hash data structure, and a sensible hashing algorithm
this looks a lot like O(n) time. Note that I'm obscuring a hidden
union-find problem in tracking repeated merges but this will not
significantly affect the run-time, and it's not clear to me at the
moment that union-find is necessary for repeated merging in subversion
if repository images are processed in temporal order.
Rick
--
http://www.rickbradley.com MUPRN: 862 (68F/68F)
| some of the answers
random email haiku | I received from the research
| side of the business.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 20:58:21 2002