> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 16 May 2002 14:24
> Philip Martin <philip@codematters.co.uk> writes:
> 
> Eek!  I don't know what went wrong, but the body of my last mail
> appears to have gone walkies :-(  Here is what I meant to say...
*grin*
 
> "Sander Striker" <striker@apache.org> writes:
> 
>> I will have time somewhere next week to continue work on it.  One
>> of the things I want to try is to implement the algorithm described
>> in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, Myers and
>> Miller.
> 
> I have some old C++ code that implements O(NP). I was porting this to
> Subversion when Sander's code appeared. He and I discussed our
> approaches and I decided to stop work on my code.  Sander contacted me
> yesterday, asking about the performance of O(NP). This prompted me to
> look out my old code and test it against GNU diff.  At first my code
> used a vtable function based approach to tokens, a bit like the one in
> svn_diff.h.  However that proved to be a performance killer, the
> algorithm has a tight loop comparing tokens and the function call
> overhead is significant.  Yesterday I stripped out the token
> interface, moving to an array interface (which is also more like GNU
> diff) and added some simple line hashing. It now runs almost as fast
> as GNU diff --minimal, its about 5% slower.
I want to see if my BV-HS implementation will benefit from some optimizations
I have in mind.
 
> While the token/vtable interface in svn_diff.h provides a generic diff
> capability, I think it may be a performance limitation. My original
> C++ code was template based, the "function" interface was non-virtual
> and could be inlined. It will be interesting to see if the algorithm
> used by Sander's code is similarly affected by this interface.
It could benefit a speedup by inlining it, although I don't know how
much.  That's something I will try.  We might need an API change there,
although I most certainly hope we can keep it generic.
> It may be that Subversion needs different algorithms depending on the
> size of the file. A minimal match algorithm for small/medium files to
> reduce the number of conflicts, and a non-minimal algorithm for large
> files to get better speed/memory performance.
Yes, that may be the case.  I received a book I ordered this week:
"Algorithms on strings, trees, and sequences" by Dan Gusfield
(ISBN 0-521-58519-8).
This is really enlighting me in what (should) work and what not.  Both
our selected algorithms are suboptimal in speed, judging from what I
read.
I've spent a reasonable amount of time in researching what we need for the
diff lib and like to continue doing some more research for a little while.
Like I said, next week has some time allocated for this.
In the mean time I'll commit a fix for a bug in the current codebase that
I stumbled over.  That way we still have something that works, albeit on
the slow side.
Sander
PS.  For the impatient and interested, these papers deserve some attention:
     
     "A sublinear algorithm for approximate keyword searching." by E.W. Meyers
     Algorithmica, 12:345-74, 1994
     "Algorithmic advances for searching biosequence databases" by E. Meyers
     Computational Methods in Genome Research, 121-35, 1994.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri May 17 18:26:52 2002