Rick Bradley wrote:
>* Greg Hudson (ghudson@MIT.EDU) [021009 12:12]:
>
>
>>An even more blue-sky idea, which Mike came up with, is to have a
>>background daemon running around looking for unexploited rep
>>similarities and redeltifying things. I think this is too hard
>>(there's no good heuristic for finding similar files except by doing
>>O(n^2) comparisons... though you might be able to play games with file
>>sizes or statistical characteristics of the contents, I suppose), but
>>it's an option.
>>
>>
>
>There's a much better way than O(n^2) as long as you have a signature
>function S() that suits your needs (md5, nilsimsa, soundex :-), etc.).
>
>Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
>in F compute the signature s_i = S(f_i) and store it in a hash along
>with i. Whenever an insert into the hash finds a prior string f_j
>there, then merge f_i and f_j in subversion.
>
No, he was talking about finding _similar_ files. Sure, you can use a
digest index to find _identical_ ones, but for similarity, only O(N^2)
will do.
>This has naive run-time O(n * (T(hash) + T(S))), excluding the
>complexity of the actual subversion merge, where T(hash) is the
>amortized run-time for a hash lookup/insert operation, and T(S) is the
>time to compute an S(f) signature. Given reasonable implementations for
>the underlying hash data structure, and a sensible hashing algorithm
>this looks a lot like O(n) time. Note that I'm obscuring a hidden
>union-find problem in tracking repeated merges but this will not
>significantly affect the run-time, and it's not clear to me at the
>moment that union-find is necessary for repeated merging in subversion
>if repository images are processed in temporal order.
>
>
--
Brane Čibej <brane_at_xbc.nu> http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 23:31:49 2002