Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Branko Čibej <brane_at_apache.org>
Date: Tue, 6 Sep 2016 12:19:49 +0200

On 05.09.2016 19:09, Stefan Hett wrote:
> On 9/5/2016 6:23 PM, Ivan Zhakov wrote:
>> With all above the new behavior should be working better or the same
>> in all cases. I agree that 50% approximation may be incorrect for some
>> specific binary formats (case 6) like sqlite db.
> To be fair, I'd argue that in case of binary file modifications the
> approximation is quite off. Most binary formats (if not all) in our
> repository differ in the first couple of bytes (if they were changed)
> and therefore it's quite a significant difference whether we read the
> full file contents of a single file (which might be >100MB) or just
> the first few bytes of two files.
>
> As Bert already suggested, I totally support the statement that it's
> quite a common design pattern for binary formats to have some
> checksum, time stamp, counter value, filesize record, etc. at the
> beginning of the file contents which is likely to differ, if the file
> has changed. If you then take the file sizes differences between text
> files and binary files into account (aka: text files usually being
> quite small, while binary files usually being quite large) it
> certainly has the potential to matter quite much that there's a
> difference expected for the binary file comparison case.
>
> FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k
> block and another for the full file) sounds therefore as a reasonable
> suggestion.
>
> Last but not least the throughput of calculating the SHA-1 is also
> restricted by the I/O throughput in practice. For working directories
> I'd assume it's not too unlikely to still reside on some HDD (rather
> than some faster cache or an SSD) so it'd be limited to around 20 MB/s
> in practice. Given large binary files this might pose a significant
> difference in certain (not uncommon) use-cases.
>
> Don't get this wrong: IMHO I agree that the SHA-1 approach is superior
> (especially on Windows machines since it will reduce the cases where
> two files have to be opened - pointer: anti virus scanner impacts). I
> just share Bert's opinion here that the approach should be a bit
> improved especially in light of binary file support.
>
> If it would be of any help, I could do some performance measurements
> with the two approaches on our repository to get some real world
> numbers to work with.

We discussed the two approaches to death years ago and decided to keep
the behaviour prior to r1759233, partly because we had no evidence one
way or the other. Changing the behaviour without having performance
measurements in hand doesn't seem like a really smart move to me ... so
yes, every little bit of extra data would surely help.

BTW, I strongly suggest not using the Subversion tree as a test case,
it's mostly small text files and so not very representative.

-- Brane
Received on 2016-09-06 12:19:53 CEST

This message: [ Message body ]
Next message: Evgeny Kotkov: "Re: Unbounded memory usage in mod_dav + mod_headers/mod_deflate/..."
Previous message: Stefan Hett: "Re: file obstruction upon merging an already merged added/moved file (#4649)"
In reply to: Stefan Hett: "Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]