Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Ivan Zhakov <ivan_at_visualsvn.com>
Date: Thu, 8 Sep 2016 18:42:37 +0300

On 5 September 2016 at 19:23, Ivan Zhakov <ivan_at_visualsvn.com> wrote:
> On 5 September 2016 at 14:46, Bert Huijben <bert_at_qqmail.nl> wrote:
>>> -----Original Message-----
>>> From: ivan_at_apache.org [mailto:ivan_at_apache.org]
>>> Sent: maandag 5 september 2016 13:33
>>> To: commits_at_subversion.apache.org
>>> Subject: svn commit: r1759233 -
>>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>>
>>> Author: ivan
>>> Date: Mon Sep 5 11:32:54 2016
>>> New Revision: 1759233
>>>
>>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>>> Log:
>>> Use SHA-1 checksum to find whether files are actually modified in working
>>> copy if timestamps don't match.
>>>
>>> Before this change we were doing this:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 2. Open pristine file.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 4. Compare filesize with pristine filesize for files that do not
>>> require translation. Assume that file is modified if the sizes differ.
>>> 5. Compare detranslated contents of working file with pristine.
>>>
>>> Now behavior is the following:
>>> 1. Compare file timestamps: if they match, assume that files didn't change.
>>> 3. Read properties from wc.db and find whether translation is required.
>>> 3. Compare filesize with pristine filesize for files that do not
>>> require translation. Assume that file is modified if the sizes differ.
>>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>>> and compare it with pristine's checksum stored in wc.db.
>>
> Hi Bert,
>
>> We looked at this before, and this change has pro-s and con-s, depending on specific use cases.
>>
> Thanks for bringing this to dev@ list, I was not aware that this topic
> was discussed before.
>
[...]

>> If the file happens to be a database file or something similar
>> there is quite commonly a change in the first 'block', when
>> there are changes somewhere later on. (Checksum, change
>> counter, etc.). File formats like sqlite were explicitly designed
>> for this (and other cheap checks), with a change counter at the start.
>
>> I don't think we should 'just change behavior' here, if we don't
>> have actual usage numbers for our users. Perhaps we should make
>> this feature configurable... or depending on filesize.
>>
>
> Let me summarize all possible cases that I considered before my
> change. First of all some definitions:
> * Text file (T) -- text file that require translation, due to eol
> style or keywords expansion
> * Text file (N) -- text file that doesn't require translation
> * Binary file -- some kind of binary file (database, pdf, zip, docx).
> Let's assume that user doesn't configure svn:eol-style and
> svn:keywords for them.
> * WS -- size of working file
> * PS -- size of pristine file
>
> * Old=xxx -- average required read size for old behavior in terms of
> working and pristine file sizes
> * New=xxx -- average required read size for new behavior in terms of
> working and pristine file sizes
>
> 1. Text file (T), not modified: Old = WS + PS, New = WS
> 2. Text file (N), not modified: Old = WS + PS, New = WS
> 3. Binary file, not modified: Old = WS + PS, New = WS
> 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
> 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
> 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
> 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, New = WS
> 8. Text file (N), modified, different size: Old = 0, New = 0
> 9. Binary file, modified, different size: Old = 0, New = 0
>
Hi Bert,

I tested several different binary file formats for no-op/minimal change:
1. Microsoft Word (docx): change single character at the end of document:
   - filesize changes (case 9)
   - first change at offset 2,295 of 233,323
2. Microsoft Word (doc): change single character at the end of document:
   - filesize didn't change (case 6)
   - first change at offset 540 of 479,232
3. sqlite database: insert one row to wc.db (2.5mb)
   - filesize didn't change (case 6)
   - first change at offset 27
4. zip archive: change single character in one of many text files (43
mb uncompressed)
   - filesize changes (case 9)
   - first change at offset 7,182,933 of 10,352,080
5. pdf file: no-op change of 800kb file
   - filesize changes (case 9)
   - first change at offset 47 of 854,971
6. Photoshop image (psd): change one pixel in the middle
   - filesize changes (case 9)
   - first change at offset 32 of 69,615,507

With this in mind, I think we can improve the current approach so that
in would be better in all possible cases. We could do this:
1. Open pristine file
2. Read first 4-16kb from pristine and normalized working file
3. Compare them: if they are equal then close pristine file, calculate
SHA1 of normalized working file and compare it with checksum in wc.db.

This behavior would only apply for only for files larger than some
threshold (e.g 1mb) to make performance penalty for opening pristine
file negligible.

What do you think?

-- 
Ivan Zhakov

Received on 2016-09-08 17:43:06 CEST

This message: [ Message body ]
Next message: Stefan Fuhrmann: "Re: fsfs: Segfault when rep line lists the all-zeroes checksum"
Previous message: Evgeny Kotkov: "Re: Unbounded memory usage in mod_dav + mod_headers/mod_deflate/..."
In reply to: Ivan Zhakov: "Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]