[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Blue-sky idea: Representation reuse

From: Rick Bradley <svn-dev_at_rickbradley.com>
Date: 2002-10-09 23:41:47 CEST

* Branko ??ibej (brane@xbc.nu) [021009 16:32]:
> >Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
> >in F compute the signature s_i = S(f_i) and store it in a hash along
> >with i. Whenever an insert into the hash finds a prior string f_j
> >there, then merge f_i and f_j in subversion.
> >
> No, he was talking about finding _similar_ files. Sure, you can use a
> digest index to find _identical_ ones, but for similarity, only O(N^2)
> will do.

Consider what happens when your digest can be used as a distance metric:

http://ixazon.dynip.com/~cmeclax/nilsimsa.html

(or similar)

Depending upon the degree of similarity required/desired one may still
have to sort digests or use a priority queue to streamline comparisons,
but that's still going to be better than O(n^2). I don't know of any
quadratic lower bounds for similarity comparisons.

Rick

-- 
 http://www.rickbradley.com    MUPRN: 465    (66F/66F)
                       |  of the que. Hehheh.
   random email haiku  |  Rimboy needs a bachelor
                       |  flag on rimboy.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Oct 9 23:42:31 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.