[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Blue-sky idea: Representation reuse

From: mark benedetto king <bking_at_inquira.com>
Date: 2002-10-13 11:39:32 CEST

On Sat, Oct 12, 2002 at 01:10:50PM -0400, Greg Hudson wrote:
> On Sat, 2002-10-12 at 09:11, mark benedetto king wrote:
> > Yes, it has been done, but AFAIK, there are the "sound" O(n^2) approaches,
> > and the "practical" O(n log n) approaches.
>
> I'm still not sure how you get n log n. Even if you can reduce the file
> contents to a hash with nice similarity properties, how do you find
> pairs of hashes which are similar? Isn't that just O(n^2) with a
> smaller constant?

Hashing to a "document signature" and then pairwise-comparing the
hashes to find is, as you suggest, O(n^2) to cluster N documents.

You're having trouble thinking of the other ways because they
don't actually work. :-)

The N*log(N) ways typically require assigning a hash to each cluster.
Then you don't need to compare document-hashes, only document-hashes
to cluster-hashes (If the documents don't cluster, you get the O(N^2)
behaviour). IMO, these approaches are fundamentally broken. But
at least they terminate before the user does.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Oct 13 11:47:24 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.