Re: My take on the diff-optimizations-bytes branch

From: Stefan Fuhrmann <stefanfuhrmann_at_alice-dsl.de>
Date: Sat, 08 Jan 2011 06:01:12 +0100

On 03.01.2011 13:16, Johan Corveleyn wrote:
> On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
> <stefanfuhrmann_at_alice-dsl.de> wrote:
>> On 03.01.2011 02:14, Johan Corveleyn wrote:
>>> On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn<jcorvel_at_gmail.com>
>>> wrote:
> [... snip ...]
>
>>>> And it's fast too! It's taking only 58 seconds in "diff", vs. 72 for
>>>> the normal version. Splitting that up further, it's taking only 28
>>>> seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
>>>> some reason, the lcs processing took a little longer, but that's
>>>> probably unrelated (only did a single run)). And that's with
>>>> scan_backward_slow. That's pretty amazing.
>> On x64, datasource_open() got about 10x faster
>> but my test-case was the TSVN changelog where
>> we (mainly) insert at the beginning instead of
>> close to end of the file. So, one broken scan
>> direction seems completely plausible ;)
> Strange, if you mainly insert at the beginning, it should be "mostly
> broken", because then it's mostly identical suffix (the part which is
> broken). But maybe it's only broken under certain edge conditions, so
> that could still explain it...
It has also been a relatively short file (<100k)
and therefore, maybe, only a single hunk.
> Maybe it's best to refer to "datasource*s*_open()", as opposed to the
> original "datasource_open()". In the original diff algorithm,
> datasources were opened each one by one (and then extracting tokens,
> inserting them into token tree, and then move on to the next
> datasource). In diff-optimizations-bytes branch, I introduced
> datasource*s*_open, to open them together, in order to chop off the
> identical prefix/suffix. It's probably not a good function name,
> because it's so similar to the old one (maybe "datasource_open_all" or
> something would be better). But OTOH: when I started with this, I
> thought I could remove/deprecate the datasource_open, once all diffN
> algorithms would use datasource*s*_open. Oh well ...
I didn't even notice that there were two such
functions and simply worked on the one that
consumed a lot of CPU time ;)
>>>> The code of scan_backward_fast is pretty difficult to understand for
>>>> me. So I'm not sure if I can spot the problem. I'll continue staring
>>>> at it for a little while ...
>>> Ok, I can't find it right now. There must be some functional
>>> difference between scan_backward_fast and scan_backward_slow.
>> One way to get closer to the problem:
>>
>> * create a backup of the current rename scan_backward_fast code
>> * copy scan_backward_slow to scan_backward_fast
>> * replace all file_len with a hard-coded "2"
>> -> can give a 50% boost depending on optimizer cleverness
>> * step-by-step apply changes from the old scan_backward_fast code
> Ok, if I have some time tonight, I will try to get closer ...
>
>>> For now, some feedback on the rest of the patch:
>>>
>>> [[[
> [... snip ...]
>
>>>> + if (had_cr)
>>>> + {
>>>> + /* Check if we ended in the middle of a \r\n for one file, but \r
>>>> for
>>>> + another. If so, back up one byte, so the next loop will back up
>>>> + the entire line. Also decrement *prefix_lines, since we counted
>>>> one
>>>> + too many for the \r. */
>>>> + if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))
>>> Here (or below DECREMENT, but within this same if condition) you need:
>>> (*prefix_lines)--;
>>>
>>> just like in the original (and *_slow) case. As is explained in the
>>> comment above: if we are here, we actually counted a full prefix line
>>> too much, because we already counted one for a matching '\r', but the
>>> character after that was '\n' in one file, but a different character
>>> in the other. So the eol-termination is different, so this is a
>>> non-matching line which needs to be fully rolled back (hence also the
>>> extra DECREMENT here).
You are right. I had some intermediate code in
place where the increment would happen later
(thus sparing the extra decrement).
>> I'm not even sure the original code is correct at all.
>> For instance, it does not handle '\r' (Mac style) EOLs
>> properly. Not really sure how to fix that.
> Are you sure it's not correct for Mac style EOLs? Can you give an example?
You are correct, the extra check for '\n' before decrementing
the line count fixes up the line-count for WIN EOLs only.

find_prefix ("a\rb", "a\rc") -> prefix_lines = 1, had_cr = TRUE
find_prefix ("a\nb", "a\nc") -> prefix_lines = 1, had_cr = FALSE

The other case that I was thinking about seems to be unsupported
by SVN anyways:

find_prefix ("a\r\nb", "a\rc") -> prefix_lines = 0, had_cr = TRUE
find_prefix ("a\n\rb", "a\nc") -> prefix_lines = 1, had_cr = FALSE

So, you are right, I'm wrong ;)

> I thought I made it work correctly with '\r' EOLs (the very first
> iterations of this (when I was still sending it as patches to the
> dev-list) only worked for \r\n and \n, but somewhere along the way, I
> made it handle \r EOLs as well). That's what the entire block below
> "/* check for eol, and count */" is for:
> - if \r: set had_cr = TRUE, count an extra prefix_line.
> - if \n: only count an additional prefix_line if the had_cr == FALSE,
> because otherwise it's the second half of a \r\n EOL, and we already
> counted it for the \r.
> - all the rest: don't care, just move on (setting had_cr = FALSE).
> - In the special case where prefix scanning ended with a mismatch
> between a '\r\n' and '\r<some other character>', decrement prefix_line
> and go back one byte, because this line wasn't identical after all.
>
> (my older versions just counted the number of \n's, which is ok for
> both \n and \r\n).
>
> The only reason to look at EOLs is to count the number of prefix_lines
> (and to rollback the last non-matching line until the previous EOL,
> but that's done by going back until finding either a \r or an \n). All
> the rest is just comparing bytes.
>
> (note that it doesn't support ignoring EOL-style differences (see the
> comment "/* ### TODO: see if we can take advantage of diff options
> like ignore_eol_style or ignore_space. */"). Maybe you're referring to
> that? If you encounter comparison between different EOL-styles, you
> just lose the prefix/suffix optimization, and fall back to the
> original algorithm, which is ok for now I think, because it's not that
> common.).
>
> Cheers,
-- Stefan^2.
Received on 2011-01-08 06:15:36 CET

This message: [ Message body ]
Next message: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"
Previous message: Paul Burba: "Re: svn commit: r966822 - in /subversion/trunk/subversion: libsvn_client/repos_diff.c tests/cmdline/merge_tests.py"
In reply to: Johan Corveleyn: "Re: My take on the diff-optimizations-bytes branch"
Next in thread: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]