RE: Diff code overhaul, WAS: RE: Problem invoking diff3.

From: Sander Striker <striker_at_apache.org>
Date: 2002-05-18 17:52:17 CEST

> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 17 May 2002 20:59

> "Sander Striker" <striker@apache.org> writes:
>>> While the token/vtable interface in svn_diff.h provides a generic diff
>>> capability, I think it may be a performance limitation. My original
>>> C++ code was template based, the "function" interface was non-virtual
>>> and could be inlined. It will be interesting to see if the algorithm
>>> used by Sander's code is similarly affected by this interface.
>>
>> It could benefit a speedup by inlining it, although I don't know how
>> much. That's something I will try. We might need an API change there,
>> although I most certainly hope we can keep it generic.
>
> The speedup obviously depends on how many token lookups and
> comparisons the algorithm requires. It's important to remember just
> how many that can be for 'difficult' files, it's lots and lots. The
> current interface requires a function call to get each token, and a
> function call to compare them. My O(NP) implementation did something
> similar, removing the function calls more than doubled the speed. The
> algorithm can't really cache the tokens, that more or less defeats the
> purpose of providing the function interface in the first place.

The BV-HS function is single pass, top-down, so no token is required more
than once. With a small trick we can even do merging without rereading.
However, other algorithms (like O(NP)) need random access to both files.
It all comes down to which algorithm we end up using.

> I worry that we have an overly generic interface, that will only be
> used for one specific problem. Producing conventional diff output
> pretty much requires a line based diff. It is unlikely that the tokens
> will ever be anything other than a variable length string of
> characters. I guess if one is comparing digital images, say, then the
> tokens might be different. However, that would use a totally
> different algorithm so the ability to support such tokens isn't
> terribly useful.

In subversion that is pretty much true. For a visual diff/merge tool
I guess it holds aswell. If we gain much performance by moving to a
less generic API, we'll move to that.

>>> It may be that Subversion needs different algorithms depending on the
>>> size of the file. A minimal match algorithm for small/medium files to
>>> reduce the number of conflicts, and a non-minimal algorithm for large
>>> files to get better speed/memory performance.
>>
>> Yes, that may be the case. I received a book I ordered this week:
>> "Algorithms on strings, trees, and sequences" by Dan Gusfield
>> (ISBN 0-521-58519-8).
>>
>> This is really enlighting me in what (should) work and what not. Both
>> our selected algorithms are suboptimal in speed, judging from what I
>> read.
>>
>> I've spent a reasonable amount of time in researching what we need for the
>> diff lib and like to continue doing some more research for a little while.
>> Like I said, next week has some time allocated for this.
>
> Fine, I'm not trying to tread on your toes :)

Don't worry, you didn't :)

> The only reason I looked at my code again was because you asked me about
> performance. I then only posted it when I realised that it was a) so close
> to GNU diff in performance

For the --minimal case. I want to see if we can get their performance all
the time. Not a showstopper though, if it works at a reasonable speed I'll
be satisfied for alpha.

> and b) only a few hundred lines to provide diff and merge.
>
> It's quite possible there are better algorithms, I'm happy that you
> are working on it and I hope you come up with a solution.

Me too! :)

> Philip

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat May 18 17:45:31 2002

This message: [ Message body ]
Next message: Mark C. Chu-Carroll: "Re: major design changes...can we"
Previous message: Noel Grandin: "Re: major design changes...can we"
In reply to: Philip Martin: "Re: Diff code overhaul, WAS: RE: Problem invoking diff3."

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]