Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Stefan Hett <stefan_at_egosoft.com>
Date: Mon, 5 Sep 2016 19:09:19 +0200

On 9/5/2016 6:23 PM, Ivan Zhakov wrote:
> With all above the new behavior should be working better or the same
> in all cases. I agree that 50% approximation may be incorrect for some
> specific binary formats (case 6) like sqlite db.
To be fair, I'd argue that in case of binary file modifications the
approximation is quite off. Most binary formats (if not all) in our
repository differ in the first couple of bytes (if they were changed)
and therefore it's quite a significant difference whether we read the
full file contents of a single file (which might be >100MB) or just the
first few bytes of two files.

As Bert already suggested, I totally support the statement that it's
quite a common design pattern for binary formats to have some checksum,
time stamp, counter value, filesize record, etc. at the beginning of the
file contents which is likely to differ, if the file has changed. If you
then take the file sizes differences between text files and binary files
into account (aka: text files usually being quite small, while binary
files usually being quite large) it certainly has the potential to
matter quite much that there's a difference expected for the binary file
comparison case.

FWIW: Markus' idea to keep two SHA-1 checksums (one for the first 4k
block and another for the full file) sounds therefore as a reasonable
suggestion.

Last but not least the throughput of calculating the SHA-1 is also
restricted by the I/O throughput in practice. For working directories
I'd assume it's not too unlikely to still reside on some HDD (rather
than some faster cache or an SSD) so it'd be limited to around 20 MB/s
in practice. Given large binary files this might pose a significant
difference in certain (not uncommon) use-cases.

Don't get this wrong: IMHO I agree that the SHA-1 approach is superior
(especially on Windows machines since it will reduce the cases where two
files have to be opened - pointer: anti virus scanner impacts). I just
share Bert's opinion here that the approach should be a bit improved
especially in light of binary file support.

If it would be of any help, I could do some performance measurements
with the two approaches on our repository to get some real world numbers
to work with.

-- 
Regards,
Stefan Hett

Received on 2016-09-05 19:09:29 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]