> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 16 May 2002 14:24
> Philip Martin <firstname.lastname@example.org> writes:
> Eek! I don't know what went wrong, but the body of my last mail
> appears to have gone walkies :-( Here is what I meant to say...
> "Sander Striker" <email@example.com> writes:
>> I will have time somewhere next week to continue work on it. One
>> of the things I want to try is to implement the algorithm described
>> in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, Myers and
> I have some old C++ code that implements O(NP). I was porting this to
> Subversion when Sander's code appeared. He and I discussed our
> approaches and I decided to stop work on my code. Sander contacted me
> yesterday, asking about the performance of O(NP). This prompted me to
> look out my old code and test it against GNU diff. At first my code
> used a vtable function based approach to tokens, a bit like the one in
> svn_diff.h. However that proved to be a performance killer, the
> algorithm has a tight loop comparing tokens and the function call
> overhead is significant. Yesterday I stripped out the token
> interface, moving to an array interface (which is also more like GNU
> diff) and added some simple line hashing. It now runs almost as fast
> as GNU diff --minimal, its about 5% slower.
I want to see if my BV-HS implementation will benefit from some optimizations
I have in mind.
> While the token/vtable interface in svn_diff.h provides a generic diff
> capability, I think it may be a performance limitation. My original
> C++ code was template based, the "function" interface was non-virtual
> and could be inlined. It will be interesting to see if the algorithm
> used by Sander's code is similarly affected by this interface.
It could benefit a speedup by inlining it, although I don't know how
much. That's something I will try. We might need an API change there,
although I most certainly hope we can keep it generic.
> It may be that Subversion needs different algorithms depending on the
> size of the file. A minimal match algorithm for small/medium files to
> reduce the number of conflicts, and a non-minimal algorithm for large
> files to get better speed/memory performance.
Yes, that may be the case. I received a book I ordered this week:
"Algorithms on strings, trees, and sequences" by Dan Gusfield
This is really enlighting me in what (should) work and what not. Both
our selected algorithms are suboptimal in speed, judging from what I
I've spent a reasonable amount of time in researching what we need for the
diff lib and like to continue doing some more research for a little while.
Like I said, next week has some time allocated for this.
In the mean time I'll commit a fix for a bug in the current codebase that
I stumbled over. That way we still have something that works, albeit on
the slow side.
PS. For the impatient and interested, these papers deserve some attention:
"A sublinear algorithm for approximate keyword searching." by E.W. Meyers
Algorithmica, 12:345-74, 1994
"Algorithmic advances for searching biosequence databases" by E. Meyers
Computational Methods in Genome Research, 121-35, 1994.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Fri May 17 18:26:52 2002