Re: Review results of diff-optimizations-bytes branch

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Mon, 7 Feb 2011 00:16:37 +0100

On Sun, Feb 6, 2011 at 9:19 PM, Stefan Fuhrmann
<stefanfuhrmann_at_alice-dsl.de> wrote:
> Hi Johan,
>
> I just finished the review of the changes on your branch.
> Except for a few minor hints, everything looks fine and
> should be merged to /trunk. The two findings can be
> addressed on /trunk directly.

Thanks. Reintegrated it in r1067800.

Some more feedback below...

> -- Stefan^2.
>
>> Index: BRANCH-README
>> ===================================================================
>> --- BRANCH-README (.../trunk) (revision 0)
>> +++ BRANCH-README (.../branches/diff-optimizations-bytes) (revision
>> 1064940)
>
> <snip>
>>
>> +Rejected ideas:
>> + * Make prefix scanning able to skip 1 or a couple of non-matching
>> lines, if
>> + it is able to find strictly more matching lines after that, to keep
>> the
>> + prefix scanning going.
>> +
>> + This is not a good idea, because it can lead to bad cases of
>> + missynchronization (see [3]):
>> +
>> + bxxaxxbxx
>> + || ||
>> + axxbxx
>> +
>> + instead of (longest common subsequence):
>> +
>> + bxxaxxbxx
>> + //////
>> + axxbxx
>> +
>
> Interesting. I think the conditions to skip N lines would be
>
> * is followed by at least 2*N exact matching lines
> (making one side of that section a "non-match" costs
> more than what we might gain for the previous lines)
> * the matching section is not a prefix to itself, e.g. does
> not match to a later part of itself (or that is at least 2N
> lines away)
>
> But I haven't done a formal proof on that.

Heh, might be a good subject for a research project :-). I'm not sure about it.

But even if it's not guaranteed to give the "longest common
subsequence", it might be a good technique if we ever get into the
area of "heuristics", for faster diffing (without guarantees that you
get a minimal diff).

>> Index: subversion/libsvn_diff/diff_memory.c
>> ===================================================================
>> --- subversion/libsvn_diff/diff_memory.c (.../trunk) (revision
>> 1064940)
>> +++ subversion/libsvn_diff/diff_memory.c
>> (.../branches/diff-optimizations-bytes) (revision 1064940)
>> @@ -88,16 +88,19 @@
>> }
>>
>>
>> -/* Implements svn_diff_fns_t::datasource_open */
>> +/* Implements svn_diff_fns2_t::datasources_open */
>> static svn_error_t *
>> -datasource_open(void *baton, svn_diff_datasource_e datasource)
>> +datasources_open(void *baton, apr_off_t *prefix_lines,
>> + svn_diff_datasource_e datasource[],
>> + apr_size_t datasource_len)
>> {
>> /* Do nothing: everything is already there and initialized to 0 */
>> + *prefix_lines = 0;
>> return SVN_NO_ERROR;
>> }
>
> So, your optimizations don't apply to string comparisons?
>
> I have no idea when this code will be run at all, but it should
> not be too hard to apply the same prefix scanning code
> to strings as well (they would appear as a single chunk of
> "file data").

IIUC, this code is only used for diffing properties. For now, I
decided to punt on that, because it didn't seem worth it, because
properties are usually not that large (and we don't have something
like "blame" for properties). Although, if you look at some
svn:mergeinfo properties ...

Anyway, like you say, we could probably easily implement the
prefix/suffix scanning here as well.

>> Index: subversion/libsvn_diff/diff_file.c
>> ===================================================================
>> --- subversion/libsvn_diff/diff_file.c (.../trunk) (revision
>> 1064940)
>> +++ subversion/libsvn_diff/diff_file.c
>> (.../branches/diff-optimizations-bytes) (revision 1064940)
>
> <snip>
>>
>> +static svn_error_t *
>> +datasources_open(void *baton, apr_off_t *prefix_lines,
>> + svn_diff_datasource_e datasource[],
>> + apr_size_t datasource_len)
>> +{
>> svn_diff__file_baton_t *file_baton = baton;
>> - struct file_info *file =
>> &file_baton->files[datasource_to_index(datasource)];
>> - apr_finfo_t finfo;
>> - apr_off_t length;
>> - char *curp;
>> - char *endp;
>> + struct file_info files[4];
>> + apr_finfo_t finfo[4];
>> + apr_off_t length[4];
>
> Maybe, the "magic" 4 should be replaced by some
> constant / define. Also, we can afford for a check
> at the start of this function comparing datasource_len
> to 4 and returning an error upon overflow.

Yes, good idea.

Thanks,

-- 
Johan

Received on 2011-02-07 00:17:38 CET

This message: [ Message body ]
Next message: Daniel Shahaf: "cache-membuffer compiler warnings"
Previous message: Gavin Beau Baumanis: "Re: svn update again"
In reply to: Stefan Fuhrmann: "Review results of diff-optimizations-bytes branch"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]