[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Diff optimizations and generating big test files

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 17 Jan 2011 02:58:23 +0100

On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>
>> Thanks for the script, it gives me some good inspiration.
>>
>> However, it doesn't fit well with the optimization that's currently
>> being done on the diff-optimizations-bytes branch, because the
>> differing lines are spread throughout the entire file.
>
> I thought you were working on two different prefix problems, but if it's
> all the same problem that's fine.  It's why I want *you* to write the
> script, then I can test your patches on my machine.  When you are
> thinking of replacing function calls with macros that's very much
> hardware/OS/compiler specific and testing on more than one platform is
> important.

Sorry it took so long (busy/interrupted with other things), but here
in attachment is finally a python script that generates two files
suitable for testing the prefix/suffix optimization of the
diff-optimizations-bytes branch:

- Without options, it generates two files file1.txt and file2.txt,
with 100,000 lines of identical prefix and 100,000 lines of identical
suffix. And in between a mis-matching section of 500 lines (with a
probability of mismatch of 50%).

- Lines are randomly generated, with random lengths between 0 and 80
(by default).

- On my machine, it generates those two files of ~8 Mb in about 17 seconds.

- Usage: see below.

Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show
the following:

1) tools/diff/diff from trunk_at_1058723:
   1.020 s

2) tools/diff/diff from diff-optimizations_at_1058811:
   0.370 s

3) tools/diff/diff from diff-optimizations_at_1058811 with stefan2's
low-level optimizations [1]:
   0.290 s

4) GNU diff:
   0.157 s

(it should be noted that svn's tools/diff/diff has a much higher
startup cost than GNU diff (for whatever reason), so that alone
accounts for part of the difference with GNU diff)

For really analyzing the benefit of the low-level optimizations (an
which part of those have the most impact), maybe bigger sample data is
needed.

===========
$ ./gen-big-files.py --help
Usage: Generate files for diff

Options:
  -h, --help show this help message and exit
  -1 FILE1, --file1=FILE1
                        filename of left file of the diff, default file1.txt
  -2 FILE2, --file2=FILE2
                        filename of right file of the diff, default file2.txt
  -p PREFIX_LINES, --prefix-lines=PREFIX_LINES
                        number of prefix lines, default 100000
  -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES
                        number of suffix lines, default 100000
  -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES
                        number of lines in the middle, non-matching section,
                        default 500
  --percent-mismatch=PERCENT_MISMATCH
                        percentage of mismatches in middle section, default 50
  --min-line-length=MIN_LINE_LENGTH
                        minimum length of randomly generated lines, default 0
  --max-line-length=MAX_LINE_LENGTH
                        maximum length of randomly generated lines, default 80

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to
integrate (some of) these suggestions into the branch. That may take
me another couple of days (identifying which changes have the biggest
speed/weight gain etc).
Received on 2011-01-17 02:59:24 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.