Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Ivan Zhakov <ivan_at_visualsvn.com>
Date: Mon, 5 Sep 2016 19:23:11 +0300

On 5 September 2016 at 14:46, Bert Huijben <bert_at_qqmail.nl> wrote:
>> -----Original Message-----
>> From: ivan_at_apache.org [mailto:ivan_at_apache.org]
>> Sent: maandag 5 september 2016 13:33
>> To: commits_at_subversion.apache.org
>> Subject: svn commit: r1759233 -
>> /subversion/trunk/subversion/libsvn_wc/questions.c
>>
>> Author: ivan
>> Date: Mon Sep 5 11:32:54 2016
>> New Revision: 1759233
>>
>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
>> Log:
>> Use SHA-1 checksum to find whether files are actually modified in working
>> copy if timestamps don't match.
>>
>> Before this change we were doing this:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 2. Open pristine file.
>> 3. Read properties from wc.db and find whether translation is required.
>> 4. Compare filesize with pristine filesize for files that do not
>> require translation. Assume that file is modified if the sizes differ.
>> 5. Compare detranslated contents of working file with pristine.
>>
>> Now behavior is the following:
>> 1. Compare file timestamps: if they match, assume that files didn't change.
>> 3. Read properties from wc.db and find whether translation is required.
>> 3. Compare filesize with pristine filesize for files that do not
>> require translation. Assume that file is modified if the sizes differ.
>> 4. Calculate SHA-1 checksum of detranslated contents of working file
>> and compare it with pristine's checksum stored in wc.db.
>
Hi Bert,

> We looked at this before, and this change has pro-s and con-s, depending on specific use cases.
>
Thanks for bringing this to dev@ list, I was not aware that this topic
was discussed before.

> With the compare to SHA we only have to read the new file, but we
> always have to read the file 100%.
>
> With the older system we could bail on the first detected change.
>
I considered this trade off. See below.

> If there is a change somewhere both systems read on average
> 100% of the filesize... only if there is no actual change except
> for the timestamp, the new system is less expensive.
>
As far I understand the average characteristics are:
1. Files are equal:
a) old behavior: 100% read of working file + 100% read of pristine
file = 200% of working file size.
b) new behavior: 100% read of working file = 100% of working file size.

2. Files modified (but has the same size or require translation (!)):
a) old behavior: 50% (average) read of working file + 50% (average)
read of pristine file = 100% of working file size.
b) new behavior: 100% read of working file = 100% of working file size.

(Strictly speaking, average read size would also depend on the number
of modifications, and it could be less than 50%.)

Also libsvn_wc checks working file size for files that doesn't require
translation, before comparing contents. And keyword expansion/newline
translation doesn't make sense for binary files (like database, pdf,
docx). And for most binary files format modification involves
changing its size.

(There were problem in old behavior because pristine file was opened
*before* comparing working file size. Fixing that would require
additional SQLite operation.)

> If the file happens to be a database file or something similar
> there is quite commonly a change in the first 'block', when
> there are changes somewhere later on. (Checksum, change
> counter, etc.). File formats like sqlite were explicitly designed
> for this (and other cheap checks), with a change counter at the start.

> I don't think we should 'just change behavior' here, if we don't
> have actual usage numbers for our users. Perhaps we should make
> this feature configurable... or depending on filesize.
>

Let me summarize all possible cases that I considered before my
change. First of all some definitions:
* Text file (T) -- text file that require translation, due to eol
style or keywords expansion
* Text file (N) -- text file that doesn't require translation
* Binary file -- some kind of binary file (database, pdf, zip, docx).
Let's assume that user doesn't configure svn:eol-style and
svn:keywords for them.
* WS -- size of working file
* PS -- size of pristine file

* Old=xxx -- average required read size for old behavior in terms of
working and pristine file sizes
* New=xxx -- average required read size for new behavior in terms of
working and pristine file sizes

1. Text file (T), not modified: Old = WS + PS, New = WS
2. Text file (N), not modified: Old = WS + PS, New = WS
3. Binary file, not modified: Old = WS + PS, New = WS
4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS, New = WS
8. Text file (N), modified, different size: Old = 0, New = 0
9. Binary file, modified, different size: Old = 0, New = 0

(There is some overhead for SHA1 calculation: SHA1 performance is
about 200-500 MB/s, but currently it's out of scope)

With all above the new behavior should be working better or the same
in all cases. I agree that 50% approximation may be incorrect for some
specific binary formats (case 6) like sqlite db.

> We certainly want the new behavior for non-pristine working copies
> (on the IDEA list for years), but I'm not sure if we always want this
> behavior as only option.
>
> This mail is partially, to just discuss this topic on the list, to make sure everybody knows what happened here and why.

-- 
Ivan Zhakov

Received on 2016-09-05 18:23:39 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]