On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin
<philip.martin_at_wandisco.com> wrote:
> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>
>> Thanks for the script, it gives me some good inspiration.
>>
>> However, it doesn't fit well with the optimization that's currently
>> being done on the diff-optimizations-bytes branch, because the
>> differing lines are spread throughout the entire file.
>
> I thought you were working on two different prefix problems, but if it's
> all the same problem that's fine. It's why I want *you* to write the
> script, then I can test your patches on my machine. When you are
> thinking of replacing function calls with macros that's very much
> hardware/OS/compiler specific and testing on more than one platform is
> important.
Sorry it took so long (busy/interrupted with other things), but here
in attachment is finally a python script that generates two files
suitable for testing the prefix/suffix optimization of the
diff-optimizations-bytes branch:
- Without options, it generates two files file1.txt and file2.txt,
with 100,000 lines of identical prefix and 100,000 lines of identical
suffix. And in between a mis-matching section of 500 lines (with a
probability of mismatch of 50%).
- Lines are randomly generated, with random lengths between 0 and 80
(by default).
- On my machine, it generates those two files of ~8 Mb in about 17 seconds.
- Usage: see below.
Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show
the following:
1) tools/diff/diff from trunk_at_1058723:
1.020 s
2) tools/diff/diff from diff-optimizations_at_1058811:
0.370 s
3) tools/diff/diff from diff-optimizations_at_1058811 with stefan2's
low-level optimizations [1]:
0.290 s
4) GNU diff:
0.157 s
(it should be noted that svn's tools/diff/diff has a much higher
startup cost than GNU diff (for whatever reason), so that alone
accounts for part of the difference with GNU diff)
For really analyzing the benefit of the low-level optimizations (an
which part of those have the most impact), maybe bigger sample data is
needed.
===========
$ ./gen-big-files.py --help
Usage: Generate files for diff
Options:
-h, --help show this help message and exit
-1 FILE1, --file1=FILE1
filename of left file of the diff, default file1.txt
-2 FILE2, --file2=FILE2
filename of right file of the diff, default file2.txt
-p PREFIX_LINES, --prefix-lines=PREFIX_LINES
number of prefix lines, default 100000
-s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES
number of suffix lines, default 100000
-m MIDDLE_LINES, --middle-lines=MIDDLE_LINES
number of lines in the middle, non-matching section,
default 500
--percent-mismatch=PERCENT_MISMATCH
percentage of mismatches in middle section, default 50
--min-line-length=MIN_LINE_LENGTH
minimum length of randomly generated lines, default 0
--max-line-length=MAX_LINE_LENGTH
maximum length of randomly generated lines, default 80
Cheers,
--
Johan
[1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to
integrate (some of) these suggestions into the branch. That may take
me another couple of days (identifying which changes have the biggest
speed/weight gain etc).
Received on 2011-01-17 02:59:24 CET