Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Bert Huijben <bert_at_qqmail.nl>
Date: Mon, 5 Sep 2016 13:46:07 +0200

> -----Original Message-----
> From: ivan_at_apache.org [mailto:ivan_at_apache.org]
> Sent: maandag 5 september 2016 13:33
> To: commits_at_subversion.apache.org
> Subject: svn commit: r1759233 -
> /subversion/trunk/subversion/libsvn_wc/questions.c
>
> Author: ivan
> Date: Mon Sep 5 11:32:54 2016
> New Revision: 1759233
>
> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> Log:
> Use SHA-1 checksum to find whether files are actually modified in working
> copy if timestamps don't match.
>
> Before this change we were doing this:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 2. Open pristine file.
> 3. Read properties from wc.db and find whether translation is required.
> 4. Compare filesize with pristine filesize for files that do not
> require translation. Assume that file is modified if the sizes differ.
> 5. Compare detranslated contents of working file with pristine.
>
> Now behavior is the following:
> 1. Compare file timestamps: if they match, assume that files didn't change.
> 3. Read properties from wc.db and find whether translation is required.
> 3. Compare filesize with pristine filesize for files that do not
> require translation. Assume that file is modified if the sizes differ.
> 4. Calculate SHA-1 checksum of detranslated contents of working file
> and compare it with pristine's checksum stored in wc.db.

We looked at this before, and this change has pro-s and con-s, depending on specific use cases.

With the compare to SHA we only have to read the new file, but we always have to read the file 100%.

With the older system we could bail on the first detected change.

If there is a change somewhere both systems read on average 100% of the filesize... only if there is no actual change except for the timestamp, the new system is less expensive.

If the file happens to be a database file or something similar there is quite commonly a change in the first 'block', when there are changes somewhere later on. (Checksum, change counter, etc.). File formats like sqlite were explicitly designed for this (and other cheap checks), with a change counter at the start.

I don't think we should 'just change behavior' here, if we don't have actual usage numbers for our users. Perhaps we should make this feature configurable... or depending on filesize.

We certainly want the new behavior for non-pristine working copies (on the IDEA list for years), but I'm not sure if we always want this behavior as only option.

This mail is partially, to just discuss this topic on the list, to make sure everybody knows what happened here and why.

Bert

(Note that it is labor day in the USA today... so I don't expect many responses until later this week)
Received on 2016-09-05 13:46:19 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]