[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Trival merge of big text file: Dismal performance, 540x faster if binary.

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 13 Jan 2011 13:55:58 +0100

On Thu, Jan 13, 2011 at 12:49 PM, krueger, Andreas (Andreas Krüger,
DV-RATIO) <andreas.krueger_at_hp.com> wrote:
> Hello,
>
> trivial merges of big text file changes are dismally slow. SVN can do
> much better when doing such merges as binary.
>
> Briefly, I think it should.  I suggest SVN should detect the trivial
> merge situation, and use the fast binary algorithm even for text
> files.
>
> I'd like to open a bug report / improvement suggestion on this.
>
> What do folks think?
>
>
> Here are the gory details:
>
> This starts with some branch F and a big text file F/b.xml (see end of
> message for details on "big").  This file has no SVN properties
> whatsoever.
>
> This got copied, with "svn cp", to some new branch T/b.xml.
>
> Then a major overhaul of F/b.xml was checked in.
>
> There had been no change in T/b.xml yet.  So merging the overhaul
> transaction from F to T is a *trivial* merge.  As the result of that
> merge, the T/b.xml content should be simply replaced with the content
> of the overhauled F/b.xml.
>
> That merge indeed worked as expected. Only it took 55:21 minutes on my
> machine. During most of that time, there was very little network or
> hard drive activity, but one CPU was kept 100% busy.
>
>
> I found a way to speed this up considerably, by a factor of 540 in
> this particular case, from 55 minutes to 6 seconds: Use binary instead
> of text.
>
> Gory details of this:
>
> New F, new F/b.xml, with same content as before.
>
> I lied to SVN and told it F/b.xml isn't a text file, but binary,
> (setting svn:mime-type to application/octet-stream on F/b.xml).
>
> After this, again svn cp to (a new T's) T/b.xml, and again the same
> overhaul to F/b.xml .
>
> The whole time, I was careful to not tell SVN there was any connection
> to the previous experiment. In particular, no svn cp from the previous
> experiment, but fresh checkin from workspace.
>
> Again, the overhaul's merge from F/b.xml to T/b.xml resulted in
> replacing the old T/b.xml content with the present F/b.xml content as
> expected. Only this time, the merge took a mere 6 something seconds
> instead of 55,3 minutes, resulting in a factor 540 speed improvement.
>
> I want to have that speed improvement, without needing to lie to SVN!
>
> Regards,
> and thanks to the SVN project members for providing fine software,
>
> Andreas
>
> P.S.:
>
> Numbers, in case someone cares:
>
> The original F/b.xml was 18,291,344 byte and 456,951 lines.
>
> The output of svn diff after the overhaul contained 676,136 lines,
> (and that svn diff took quite a while to complete, which is
> understandable and not part of this issue).
>
> The overhauled F/b.xml was 18,311,873 byte and 688,560 lines.
>
> I had similar performance problem experiences with various SVN
> clients. The times quoted above were Cygwin's svn command line 1.6.12
> on Windows. Protocol used was HTTPS, server Apache HTTPD with svn
> module (also 1.6.12).

Hi Andreas,

This is interesting, because it just so happens that I've been working
on a feature branch in svn (on and off for the past half year) for
performance improvements for the diff algorithm in svn, especially for
big files (I have also been using a "big" xml file for testing, of
around 60,000 lines).

Textual merging in svn makes use of a variant of the standard diff
algorithm, namely diff3. Just a couple of days ago, I finally
succeeded in making diff3 take advantage of those performance
improvements (haven't committed this to the branch yet, but maybe I'll
get to it tonight).

Would you be able to build an svn client from source? If so, could you
perhaps build a client from
http://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes
?

This already contains the performance improvement for regular 'svn
diff', so you could test if that makes any difference. If you wait
until I've committed the changes to diff3, you could perhaps see the
impact on the merge you're trying to do.

[note: this performance improvement is currently not included in the
svn trunk, so it's not currently on track to be included in 1.7.
However, I think it's still an option (depends on some more work on
the branch, and then possibly review, some tweaks, ... if the other
devs agree with this change)]

[note2: don't expect this perf improvement to bring it down to 6
seconds but it might still make a big difference (it works very well
if both files are quite similar, and the changes are close together in
the file (a lot of identical prefix and suffix)). Judging from your
description though, there is a big difference between both versions of
the file (of 200,000+ lines).]

Cheers,

-- 
Johan
Received on 2011-01-13 13:57:00 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.