[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 20 Dec 2010 23:15:51 +0100

On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>
>> This makes the diff algorithm another 10% - 15%
>> faster (granted, this was measured with my "extreme" testcase of a 1,5
>> Mb file (60000 lines), of which most lines are identical
>> prefix/suffix).
>
> Can you provide a test script?  Or decribe the test more fully, please.

Hmm, it's not easy to come up with a test script to test this "from
scratch" (unless with testing diff directly, see below). I test it
with a repository (a dump/load of an old version of our production
repository) which contains this 60000 line xml file (1,5 Mb) with 2272
revisions.

I run blame on this file, over svnserve protocol on localhost (server
running on same machine), with an svnserve built from Stefan^2's
performance branch (with membuffer caching of full-texts, so server
I/O is not the bottleneck). This gives me an easy way to call 2272
times diff on this file, and measure it (with the help of some
instrumentation code in blame.c, see attachment). And it's
incidentally the actual use case I first started out wanting to
optimize (blame for large files with many revisions).

This is the actual command I use, and the output generated by the
instrumentation in blame.c:
[[[
$ time svn blame -x-b svn://localhost/trunk/path/to/settings.xml >/dev/null
### blame took 117546875 usec (117 s)
### file_rev_handler: 3203125 (3 s) - window_handler: 110781250 (110 s)
### wrapped_handler: 37859375 (37 s) - diff: 70921875 (70 s) -
blame_process: 1015625 (1 s)

real 1m58.008s
user 0m0.015s
sys 0m0.031s
]]]

(note: I use -x-b option in this case, because for some reason this
speeds it up tremendously. This probably has something to do with my
test data, which contains in its history some "all tabs to spaces" and
"all spaces to tabs" revisions.).

Some background info on this instrumentation output:
- "blame took ...": timing before and after doing all the stuff
(around the call to svn_ra_get_file_revs2, which includes all the
callbacks).
- "file_rev_handler" and "window_handler": timing of all the useful
work that's done at the client side (so this kind of excludes the time
that the client is simply waiting for the server).
- Last line contains parts of the window_handler time:
- "wrapped_handler": time taken to build all the full-texts at the client side.
- "diff": time spent in calls to svn_diff_file_diff_2 (this is the one
I'm trying to optimize with the diff-optimizations stuff).
- "blame_process": the time taken to create and insert the blame
chunks (linked list with blame information). As you can see this is
quite negligible.

So, I'm mainly looking at the time reported after "diff:".

An alternative way to test this, which may be scriptable: testing diff
directly, by "svn diffing" a large file. I can notice small
differences (in the area of 10 or 20 milliseconds) when simply
executing a single "svn diff" of settings.xml, with one line modified.
But it's too small to make any definite conclusions (inaccuracy,
overhead of program startup, ...). Maybe a simple test in c, with a
for loop with many iterations calling svn_diff_file_diff_2, would be
better.

I guess it would be easy to script the creation of a new repository,
commit a file in it with 100000 lines, modify one line, and diff that
while measuring it.

(the best example I found in subversion's own repository on
svn.apache.org, was subversion/tests/cmdline/merge_tests.py. This has
~16500 lines, and has about 660 changes)

Cheers,

-- 
Johan

Received on 2010-12-20 23:16:50 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.