[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Blue-sky idea: Representation reuse

From: Branko Čibej <brane_at_xbc.nu>
Date: 2002-10-09 23:57:24 CEST

Rick Bradley wrote:

>* Branko ??ibej (brane@xbc.nu) [021009 16:32]:
>
>
>>>Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
>>>in F compute the signature s_i = S(f_i) and store it in a hash along
>>>with i. Whenever an insert into the hash finds a prior string f_j
>>>there, then merge f_i and f_j in subversion.
>>>
>>>
>>>
>>No, he was talking about finding _similar_ files. Sure, you can use a
>>digest index to find _identical_ ones, but for similarity, only O(N^2)
>>will do.
>>
>>
>
>Consider what happens when your digest can be used as a distance metric:
>
>http://ixazon.dynip.com/~cmeclax/nilsimsa.html
>
>(or similar)
>
>Depending upon the degree of similarity required/desired one may still
>have to sort digests or use a priority queue to streamline comparisons,
>but that's still going to be better than O(n^2). I don't know of any
>quadratic lower bounds for similarity comparisons.
>
>

Hm, good point. Usually, when I think of digests, I assume cryptographic
hashes, which you certainly can't use as a distance metric (at least I
hope so; if you can, then i'm not using ssh any more :-).

Hm. Using a nilsimsa hash for a key would let you find the most similar
representation in O(log N) time, then? Interesting indeed.

-- 
Brane Čibej   <brane_at_xbc.nu>   http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 23:58:09 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.