Diff optimizations and generating big test files (was: Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?))

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 5 Jan 2011 16:13:42 +0100

On Wed, Jan 5, 2011 at 2:17 PM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>
>> Another question: a shell script might not be good, because not
>> portable (and not fast)? Should I use python for this? Maybe the
>> "write line by line with a line number in a for loop" would be a lot
>> faster in Python? I don't know a lot of python, but it might be a good
>> opportunity to learn some ...
>
> A shell script is probably fine. What I want is some data that I can
> use on my machine to test your patches.
>
> Here's a crude python script. With the default values it generates two
> 4.3MB files in less than 2 seconds on my machine. Subversion diff takes
> over 10 seconds to compare the files, GNU diff less than one second.
>
> Using --num-prefix=2 makes the script slight slower, since it generates
> more random numbers, and the time to run Subversion diff on the output
> goes up to 2min. GNU diff still takes a fraction of a second, and with
> --minimal the time is 35s. So for big improvements you probably want to
> concentrate on shortcut heuristics, rather than low-level optimisation.
>
> #!/usr/bin/python
>
> import random, sys
> from optparse import OptionParser
>
> random.seed('abc') # repeatable
>
> def write_file_contents(f, num_lines, num_prefix, num_suffix,
> percent_middle, unique):
> for i in range(num_lines):
> if num_prefix > 1:
> prefix = random.randint(1, num_prefix)
> else:
> prefix = 1
> line = str(prefix) + "-common-prefix-" + str(prefix)
>
> middle = random.randint(1, 100)
> if middle <= percent_middle:
> line += " " + str(12345678 + i) + " "
> else:
> line += " " + str(9999999999 + i) + unique + " "
>
> if num_suffix > 1:
> suffix = random.randint(1, num_suffix)
> else:
> suffix = 1
> line += str(suffix) + "-common-suffix-" + str(suffix)
> f.write(line + '\n')
>
>
> parser = OptionParser('Generate files for diff')
> parser.add_option('--num-lines', type=int, default=100000, dest='num_lines',
> help='number of lines, default 100000')
> parser.add_option('--num-prefix', type=int, default=1, dest='num_prefix',
> help='number of distinct prefixes, default 1')
> parser.add_option('--num-suffix', type=int, default=1, dest='num_suffix',
> help='number of distinct suffixes, default 1')
> parser.add_option('--percent-middle', type=int, default=99,
> dest='percent_middle',
> help='percentage matching middles, default 99')
> (options, args) = parser.parse_args(sys.argv)
>
> f1 = open('file1.txt', 'w')
> write_file_contents(f1, options.num_lines,
> options.num_prefix, options.num_suffix,
> options.percent_middle, 'a')
>
> f2 = open('file2.txt', 'w')
> write_file_contents(f2, options.num_lines,
> options.num_prefix, options.num_suffix,
> options.percent_middle, 'b')

Thanks for the script, it gives me some good inspiration.

However, it doesn't fit well with the optimization that's currently
being done on the diff-optimizations-bytes branch, because the
differing lines are spread throughout the entire file.

The current optimization on diff-optimization-bytes is not a low-level
optimization (though the original thread was about low-level
optimization of the high-level optimization implemented on the branch
(which also makes a *huge* impact, see the other thread with Stefan2's
optimization work on this branch)), nor is it a heuristic. It just
strips away all identical prefix and suffix *lines* from both files,
before doing the hard work. That's something that GNU diff does as
well (in svn's case it's more difficult to implement, because the
files are read in chunks, whereas GNU diff reads both files entirely
into memory before it starts).

So if you want to see the impact of the optimization in action, you
have to generate files that have a lot of identical lines at the
beginning and/or the end (and the differing lines somewhere together).
(sorry if I caused some misunderstanding about identical prefix/suffix
being entire lines vs. parts of lines)

Of course, there may be other optimizations possible, to make svn's
diff about as fast as GNU's, even without lots of identical
prefix/suffix. Some things I observed during tests last week (and
which may provide food for thought for further optimizations):

- svn diff's performance goes way down if the lines are too much alike
(especially the beginnings of the lines). Tests with the files
generated by your script seem to agree with that (the lines all start
with "1-common-prefix-1 1xxxx"). It's way better if you generate more
different lines.

- If the lines are different in length, svn diff also performs a lot
better than if they are all equally long.

- If a lot of lines are exactly the same (like the same chunk of lines
being repeated over and over again), it also performs better.

If I understand the code correctly, it seems kind of logical that
these factors play a role:

- The "regular" svn diff algorithm puts all lines in a tree (see
libsvn_diff/token.c#svn_diff__tree_insert_token). (actually, it's a
"forest", because first the root of the tree is chosen out of an array
of roots, by using the adler32 hash of the line (modulo the size of
the array of roots)).

- They are compared first by their adler32 hash, then by a call to
vtable->token_compare.

- diff_file.c#token_compare first compares the length. If that's the
same, it basically does a memcmp of both lines (reading additional
chunks from file if necessary, and normalizing (ignore-eol-style,
ignore-whitespace) in the process).

- each identical line is inserted only once in the tree (so the tree
doesn't grow too much if a lot of lines are the same).

I suspect that often, this tree becomes quite unbalanced after a while
(depending on the data), and inserting lines becomes expensive. It
might be interesting to analyze what the tree (or forest) looks like
after reading in such large files. Adding some balancing to that tree
might be a very worthwhile optimization...

Finally, one more optimization idea that comes to mind (that I also
mentioned in diff-optimization-bytes/BRANCH-README): when trying to
insert a new line into the token tree, and the previous line had a
match with a line from the other file, it might be worth it to first
try to compare the new line with the successor of that matched line
(maybe even before doing the effort of calculating the adler32 hash).
Only if that doesn't match, go and calculate the adler32 checksum, and
find a place in the tree to insert it. This would take advantage of
the fact that matching lines often come in chunks: if one line
matches, chances are reasonable that the next line will also match.

Cheers,

-- 
Johan

Received on 2011-01-05 16:14:39 CET

This message: [ Message body ]
Next message: Philip Martin: "Re: Diff optimizations and generating big test files"
Previous message: C. Michael Pilato: "Re: [PATCH] '--version --quiet' should display only the version info in svnadmin, svnlook, svnsync, svndumpfilter, svnversion, svnserve"
Next in thread: Philip Martin: "Re: Diff optimizations and generating big test files"
Reply: Philip Martin: "Re: Diff optimizations and generating big test files"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]