On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.foad_at_wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.
Yes, I'll certainly do that in the coming days.
>> >> Also, maybe the last_chunk number could be included in the file_info
>> >> struct? Now it's calculated in several places: "last_chunk0 =
>> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> >> every time as an extra argument. Seems like the sort of info that
>> >> could be part of file_info.
>> >
>> > It looks to me like you only need to calculate it in exactly one place:
>> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating
>> > for efficiency, as it's a trivial calculation.
>>
>> Hm, I don't know. That's recalculating it for every byte in the
>> prefix. In the files I'm testing there could be 1 Mb or more of
>
> No, no, not every byte - only 1 in 131072 of the calls need to calculate
> it - only when it reaches the end of a block.
Oh right. I guess it's ok then.
> May I ask whether you've tested the code that crosses from one chunk to
> another? I noticed the following, just now:
>
>> + *at_least_one_end_reached =
>> + file_baton->curp[idx0] == file_baton->endp[idx0]
>> + || file_baton->curp[idx1] == file_baton->endp[idx1];
>> + }
>> +
>> + /* If both files reached their end (i.e. are fully identical),
>> we're done */
>> + if (file_baton->curp[idx0] == file_baton->endp[idx0]
>> + && file_baton->curp[idx1] == file_baton->endp[idx1])
>
> Those tests seem to be saying "if we're at the end of the current chunk
> then we're at the end of the file". That looks wrong. What do you
> think?
Well, it does work :-). The code is correct, but it's kind of
difficult to see, so that's my fault.
The reason is that increment_pointer_or_chunk takes care that curp
never equals endp (if curp==endp-1 it reads the next chunk), unless it
reaches end of file. See this snippet of increment_pointer_or_chunk:
+ if (*chunk_number == last_chunk_number)
+ (*curp)++; /* *curp == *endp with last chunk signals end of file */
If you think this is too obscure to handle this, I agree, but I
couldn't think of a better/easier way at that time. If you have any
suggestions, feel free :-). But maybe I should refactor the patch
first, so it's easier to see how this would work out best.
The same "obscurity" also applies to reaching "beginning of file"
(which can happen in decrement_pointer_or_chunk, either during prefix
scanning (rolling back the first line) or during suffix scanning (if
there was no prefix, but still a difference in the first line)). There
it's done by setting the chunk number to -1, which is checked for in
find_identical_prefix and find_identical_suffix:
+ if (*chunk_number == 0)
+ (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */
Not ideal, but it works ...
Cheers,
--
Johan
Received on 2010-10-12 13:06:16 CEST