Re: Diff code overhaul, WAS: RE: Problem invoking diff3.

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2002-05-17 20:59:09 CEST

"Sander Striker" <striker@apache.org> writes:

> > While the token/vtable interface in svn_diff.h provides a generic diff
> > capability, I think it may be a performance limitation. My original
> > C++ code was template based, the "function" interface was non-virtual
> > and could be inlined. It will be interesting to see if the algorithm
> > used by Sander's code is similarly affected by this interface.
>
> It could benefit a speedup by inlining it, although I don't know how
> much. That's something I will try. We might need an API change there,
> although I most certainly hope we can keep it generic.

The speedup obviously depends on how many token lookups and
comparisons the algorithm requires. It's important to remember just
how many that can be for 'difficult' files, it's lots and lots. The
current interface requires a function call to get each token, and a
function call to compare them. My O(NP) implementation did something
similar, removing the function calls more than doubled the speed. The
algorithm can't really cache the tokens, that more or less defeats the
purpose of providing the function interface in the first place.

I worry that we have an overly generic interface, that will only be
used for one specific problem. Producing conventional diff output
pretty much requires a line based diff. It is unlikely that the tokens
will ever be anything other than a variable length string of
characters. I guess if one is comparing digital images, say, then the
tokens might be different. However, that would use a totally
different algorithm so the ability to support such tokens isn't
terribly useful.

>
> > It may be that Subversion needs different algorithms depending on the
> > size of the file. A minimal match algorithm for small/medium files to
> > reduce the number of conflicts, and a non-minimal algorithm for large
> > files to get better speed/memory performance.
>
> Yes, that may be the case. I received a book I ordered this week:
> "Algorithms on strings, trees, and sequences" by Dan Gusfield
> (ISBN 0-521-58519-8).
>
> This is really enlighting me in what (should) work and what not. Both
> our selected algorithms are suboptimal in speed, judging from what I
> read.
>
> I've spent a reasonable amount of time in researching what we need for the
> diff lib and like to continue doing some more research for a little while.
> Like I said, next week has some time allocated for this.

Fine, I'm not trying to tread on your toes :) The only reason I looked
at my code again was because you asked me about performance. I then
only posted it when I realised that it was a) so close to GNU diff in
performance and b) only a few hundred lines to provide diff and merge.
It's quite possible there are better algorithms, I'm happy that you
are working on it and I hope you come up with a solution.

-- 
Philip
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Fri May 17 21:00:20 2002

This message: [ Message body ]
Next message: Dean Hiller: "major design changes...can we"
Previous message: Marcus Comstedt: "Re: UTF-8"
In reply to: Sander Striker: "Diff code overhaul, WAS: RE: Problem invoking diff3."
Next in thread: Sander Striker: "RE: Diff code overhaul, WAS: RE: Problem invoking diff3."
Reply: Sander Striker: "RE: Diff code overhaul, WAS: RE: Problem invoking diff3."

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]