Re: My take on the diff-optimizations-bytes branch

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 3 Jan 2011 13:16:04 +0100

On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
<stefanfuhrmann_at_alice-dsl.de> wrote:
> On 03.01.2011 02:14, Johan Corveleyn wrote:
>>
>> On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn<jcorvel_at_gmail.com>
>> wrote:
>>>
[... snip ...]

>>> And it's fast too! It's taking only 58 seconds in "diff", vs. 72 for
>>> the normal version. Splitting that up further, it's taking only 28
>>> seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
>>> some reason, the lcs processing took a little longer, but that's
>>> probably unrelated (only did a single run)). And that's with
>>> scan_backward_slow. That's pretty amazing.
>
> On x64, datasource_open() got about 10x faster
> but my test-case was the TSVN changelog where
> we (mainly) insert at the beginning instead of
> close to end of the file. So, one broken scan
> direction seems completely plausible ;)

Strange, if you mainly insert at the beginning, it should be "mostly
broken", because then it's mostly identical suffix (the part which is
broken). But maybe it's only broken under certain edge conditions, so
that could still explain it...

Maybe it's best to refer to "datasource*s*_open()", as opposed to the
original "datasource_open()". In the original diff algorithm,
datasources were opened each one by one (and then extracting tokens,
inserting them into token tree, and then move on to the next
datasource). In diff-optimizations-bytes branch, I introduced
datasource*s*_open, to open them together, in order to chop off the
identical prefix/suffix. It's probably not a good function name,
because it's so similar to the old one (maybe "datasource_open_all" or
something would be better). But OTOH: when I started with this, I
thought I could remove/deprecate the datasource_open, once all diffN
algorithms would use datasource*s*_open. Oh well ...

>>> The code of scan_backward_fast is pretty difficult to understand for
>>> me. So I'm not sure if I can spot the problem. I'll continue staring
>>> at it for a little while ...
>>
>> Ok, I can't find it right now. There must be some functional
>> difference between scan_backward_fast and scan_backward_slow.
>
> One way to get closer to the problem:
>
> * create a backup of the current rename scan_backward_fast code
> * copy scan_backward_slow to scan_backward_fast
> * replace all file_len with a hard-coded "2"
> -> can give a 50% boost depending on optimizer cleverness
> * step-by-step apply changes from the old scan_backward_fast code

Ok, if I have some time tonight, I will try to get closer ...

>> For now, some feedback on the rest of the patch:
>>
>> [[[

[... snip ...]

>>> + if (had_cr)
>>> + {
>>> + /* Check if we ended in the middle of a \r\n for one file, but \r
>>> for
>>> + another. If so, back up one byte, so the next loop will back up
>>> + the entire line. Also decrement *prefix_lines, since we counted
>>> one
>>> + too many for the \r. */
>>> + if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))
>>
>> Here (or below DECREMENT, but within this same if condition) you need:
>> (*prefix_lines)--;
>>
>> just like in the original (and *_slow) case. As is explained in the
>> comment above: if we are here, we actually counted a full prefix line
>> too much, because we already counted one for a matching '\r', but the
>> character after that was '\n' in one file, but a different character
>> in the other. So the eol-termination is different, so this is a
>> non-matching line which needs to be fully rolled back (hence also the
>> extra DECREMENT here).
>>
> I'm not even sure the original code is correct at all.
> For instance, it does not handle '\r' (Mac style) EOLs
> properly. Not really sure how to fix that.

Are you sure it's not correct for Mac style EOLs? Can you give an example?

I thought I made it work correctly with '\r' EOLs (the very first
iterations of this (when I was still sending it as patches to the
dev-list) only worked for \r\n and \n, but somewhere along the way, I
made it handle \r EOLs as well). That's what the entire block below
"/* check for eol, and count */" is for:
- if \r: set had_cr = TRUE, count an extra prefix_line.
- if \n: only count an additional prefix_line if the had_cr == FALSE,
because otherwise it's the second half of a \r\n EOL, and we already
counted it for the \r.
- all the rest: don't care, just move on (setting had_cr = FALSE).
- In the special case where prefix scanning ended with a mismatch
between a '\r\n' and '\r<some other character>', decrement prefix_line
and go back one byte, because this line wasn't identical after all.

(my older versions just counted the number of \n's, which is ok for
both \n and \r\n).

The only reason to look at EOLs is to count the number of prefix_lines
(and to rollback the last non-matching line until the previous EOL,
but that's done by going back until finding either a \r or an \n). All
the rest is just comparing bytes.

(note that it doesn't support ignoring EOL-style differences (see the
comment "/* ### TODO: see if we can take advantage of diff options
like ignore_eol_style or ignore_space. */"). Maybe you're referring to
that? If you encounter comparison between different EOL-styles, you
just lose the prefix/suffix optimization, and fall back to the
original algorithm, which is ok for now I think, because it's not that
common.).

Cheers,

-- 
Johan

Received on 2011-01-03 13:17:01 CET

This message: [ Message body ]
Next message: Gavin Beau Baumanis: "Re: [PATCH] tweaked svnrdump to handle junk values with "--version""
Previous message: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"
In reply to: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"
Next in thread: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"
Reply: Stefan Fuhrmann: "Re: My take on the diff-optimizations-bytes branch"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]