Diff code overhaul, WAS: RE: Problem invoking diff3.

From: Sander Striker <striker_at_apache.org>
Date: 2002-05-17 18:33:28 CEST

> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 16 May 2002 14:24

> Philip Martin <philip@codematters.co.uk> writes:
>
> Eek! I don't know what went wrong, but the body of my last mail
> appears to have gone walkies :-( Here is what I meant to say...

*grin*

> "Sander Striker" <striker@apache.org> writes:
>
>> I will have time somewhere next week to continue work on it. One
>> of the things I want to try is to implement the algorithm described
>> in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, Myers and
>> Miller.
>
> I have some old C++ code that implements O(NP). I was porting this to
> Subversion when Sander's code appeared. He and I discussed our
> approaches and I decided to stop work on my code. Sander contacted me
> yesterday, asking about the performance of O(NP). This prompted me to
> look out my old code and test it against GNU diff. At first my code
> used a vtable function based approach to tokens, a bit like the one in
> svn_diff.h. However that proved to be a performance killer, the
> algorithm has a tight loop comparing tokens and the function call
> overhead is significant. Yesterday I stripped out the token
> interface, moving to an array interface (which is also more like GNU
> diff) and added some simple line hashing. It now runs almost as fast
> as GNU diff --minimal, its about 5% slower.

I want to see if my BV-HS implementation will benefit from some optimizations
I have in mind.

> While the token/vtable interface in svn_diff.h provides a generic diff
> capability, I think it may be a performance limitation. My original
> C++ code was template based, the "function" interface was non-virtual
> and could be inlined. It will be interesting to see if the algorithm
> used by Sander's code is similarly affected by this interface.

It could benefit a speedup by inlining it, although I don't know how
much. That's something I will try. We might need an API change there,
although I most certainly hope we can keep it generic.

> It may be that Subversion needs different algorithms depending on the
> size of the file. A minimal match algorithm for small/medium files to
> reduce the number of conflicts, and a non-minimal algorithm for large
> files to get better speed/memory performance.

Yes, that may be the case. I received a book I ordered this week:
"Algorithms on strings, trees, and sequences" by Dan Gusfield
(ISBN 0-521-58519-8).

This is really enlighting me in what (should) work and what not. Both
our selected algorithms are suboptimal in speed, judging from what I
read.

I've spent a reasonable amount of time in researching what we need for the
diff lib and like to continue doing some more research for a little while.
Like I said, next week has some time allocated for this.

In the mean time I'll commit a fix for a bug in the current codebase that
I stumbled over. That way we still have something that works, albeit on
the slow side.

Sander

PS. For the impatient and interested, these papers deserve some attention:

     "A sublinear algorithm for approximate keyword searching." by E.W. Meyers
     Algorithmica, 12:345-74, 1994

"Algorithmic advances for searching biosequence databases" by E. Meyers
Computational Methods in Genome Research, 121-35, 1994.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri May 17 18:26:52 2002

This message: [ Message body ]
Next message: Marcus Comstedt: "Re: UTF-8"
Previous message: Stephen C. Tweedie: "Re: DB_CONFIG on svn.collab.net"
In reply to: Philip Martin: "Re: Problem invoking diff3."
Next in thread: Philip Martin: "Re: Diff code overhaul, WAS: RE: Problem invoking diff3."
Reply: Philip Martin: "Re: Diff code overhaul, WAS: RE: Problem invoking diff3."

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]