[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Trival merge of big text file: Dismal performance, 540x faster if binary.

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Fri, 14 Jan 2011 23:52:10 +0100

On Fri, Jan 14, 2011 at 3:53 PM, krueger, Andreas (Andreas Krüger,
DV-RATIO) <andreas.krueger_at_hp.com> wrote:
> Hello, Johan and all,
>
> first, for the record, here is another comparison between
> binary and text merge performance, this time with the files
> generated by my script (repeated below):
>
> Binary merge took 3.56 seconds, text merge took 3:45:45.202 hours.
> In this particular case, binary merge performance was 3805 times
> faster than text merge performance.
>
>
>
>> Textual merging in svn makes use of a variant of the standard diff
>> algorithm, namely diff3.  Just a couple of days ago, I finally
>> succeeded in making diff3 take advantage of ... performance
>> improvements ... .
>
> Good news! Excellent! Thank you!
>
> But... does this relate to my problem?
>
> The improved diff3 will give a nice performance improvement in the
> *general* case.
>
> I certainly want that improvement!
>
>
> Another nice performance improvement of a factor of several hundreds
> (or thousands) could be obtained for big files in the *trivial* case,
> if SVN didn't diff3 at all, but simply copied the result.
>
> I also want this other improvement!
>
>
> Finally:
>
> SVN already contains the intelligence needed to find out whether a
> merge is trivial or not.  For, in the binary case, the trivial merges
> are precisely the ones that SVN knows how to do.
>
>
> Johan (or whoever else), please kindly enlighten me, should I be
> missing something!

Ok, after rereading this thread, I'm starting to understand what you
mean: why would "merge" perform an expensive diffing algorithm while
it can just be 100% sure that it can simply copy the contents from the
source to the target (because the target file has not had any changes
since it was branched)?

I think it's a good suggestion, but I can't really comment more on
(the feasibility of) it, because I'm not that familiar with that part
of the codebase. I've only concentrated on the diff algorithm itself
(and how it's used by "svn diff" and "svn merge" (for text files)).
Maybe someone else can chime in to comment on that?

Of course, if there would be any change to the target file, it
wouldn't be a trivial merge (copy contents) anymore, so you would
immediately fallback to the very expensive case. But I agree that's
hardly a reason not to take the shortcut when you can...

A couple more thoughts on the diff-side of the story:
- Your perl script presents an extremely hard case for the diff
algorithm, because:
* Files A and B differ in each and every one of the 1000000 lines (so
my prefix/suffix scanning optimization will not help at all).
* All lines in each file are also unique inside the file (this makes
it more expensive).
* Most lines have identical lengths as other lines (also makes it more
expensive).
* The lines are very short (the diff algorithm is mainly proportional
with the number of lines).

These things are very atypical when you compare with real-world
examples. Usually there is some identical part at the beginning and/or
the end, and lines vary a lot in length. If there is a significant
portion of identical lines at the beginning and/or the end, the
optimizations in the diff-optimizations-bytes branch will help a lot.

- Interestingly, GNU diff calculates the diff between these files in 7
seconds on my machine. But if I give it the option '--minimal', it
also runs for hours (started it 2 hours ago; it's still running).

- Can you try the merge on your original example (big.xml) with the
option -x-b (ignore changes in amount of whitespace)? Just to know if
it would make a difference. In my tests this made diff *much* faster,
so I'm guessing the same for merge. Of course, this depends entirely
on the example data (won't help a bit for the perl-generated files
(will be slowed down even more)).

Cheers,

-- 
Johan
Received on 2011-01-14 23:53:09 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.