Re: Check SHA vs Content (was: RE: svn commit: r1759233 - /subversion/trunk/subversion/libsvn_wc/questions.c)

From: Bert Huijben <bert_at_qqmail.nl>
Date: Mon, 20 Feb 2017 09:05:25 +0100

This code is still in trunk without any of the discussed improvements, so
this change is currently part of 1.10.0-alpha1.

If we don't implement the improvements I think we should check if we want
to revert to the 1.0-1.9 behavior before we really look at releasing 1.10.

See discussion below

Bert

On Thu, Sep 8, 2016 at 5:42 PM, Ivan Zhakov <ivan_at_visualsvn.com> wrote:

> On 5 September 2016 at 19:23, Ivan Zhakov <ivan_at_visualsvn.com> wrote:
> > On 5 September 2016 at 14:46, Bert Huijben <bert_at_qqmail.nl> wrote:
> >>> -----Original Message-----
> >>> From: ivan_at_apache.org [mailto:ivan_at_apache.org]
> >>> Sent: maandag 5 september 2016 13:33
> >>> To: commits_at_subversion.apache.org
> >>> Subject: svn commit: r1759233 -
> >>> /subversion/trunk/subversion/libsvn_wc/questions.c
> >>>
> >>> Author: ivan
> >>> Date: Mon Sep 5 11:32:54 2016
> >>> New Revision: 1759233
> >>>
> >>> URL: http://svn.apache.org/viewvc?rev=1759233&view=rev
> >>> Log:
> >>> Use SHA-1 checksum to find whether files are actually modified in
> working
> >>> copy if timestamps don't match.
> >>>
> >>> Before this change we were doing this:
> >>> 1. Compare file timestamps: if they match, assume that files didn't
> change.
> >>> 2. Open pristine file.
> >>> 3. Read properties from wc.db and find whether translation is required.
> >>> 4. Compare filesize with pristine filesize for files that do not
> >>> require translation. Assume that file is modified if the sizes
> differ.
> >>> 5. Compare detranslated contents of working file with pristine.
> >>>
> >>> Now behavior is the following:
> >>> 1. Compare file timestamps: if they match, assume that files didn't
> change.
> >>> 3. Read properties from wc.db and find whether translation is required.
> >>> 3. Compare filesize with pristine filesize for files that do not
> >>> require translation. Assume that file is modified if the sizes
> differ.
> >>> 4. Calculate SHA-1 checksum of detranslated contents of working file
> >>> and compare it with pristine's checksum stored in wc.db.
> >>
> > Hi Bert,
> >
> >> We looked at this before, and this change has pro-s and con-s,
> depending on specific use cases.
> >>
> > Thanks for bringing this to dev@ list, I was not aware that this topic
> > was discussed before.
> >
> [...]
>
> >> If the file happens to be a database file or something similar
> >> there is quite commonly a change in the first 'block', when
> >> there are changes somewhere later on. (Checksum, change
> >> counter, etc.). File formats like sqlite were explicitly designed
> >> for this (and other cheap checks), with a change counter at the start.
> >
> >> I don't think we should 'just change behavior' here, if we don't
> >> have actual usage numbers for our users. Perhaps we should make
> >> this feature configurable... or depending on filesize.
> >>
> >
> > Let me summarize all possible cases that I considered before my
> > change. First of all some definitions:
> > * Text file (T) -- text file that require translation, due to eol
> > style or keywords expansion
> > * Text file (N) -- text file that doesn't require translation
> > * Binary file -- some kind of binary file (database, pdf, zip, docx).
> > Let's assume that user doesn't configure svn:eol-style and
> > svn:keywords for them.
> > * WS -- size of working file
> > * PS -- size of pristine file
> >
> > * Old=xxx -- average required read size for old behavior in terms of
> > working and pristine file sizes
> > * New=xxx -- average required read size for new behavior in terms of
> > working and pristine file sizes
> >
> > 1. Text file (T), not modified: Old = WS + PS, New = WS
> > 2. Text file (N), not modified: Old = WS + PS, New = WS
> > 3. Binary file, not modified: Old = WS + PS, New = WS
> > 4. Text file (T), modified, same size: Old = 0.5 * WS + 0.5 * PS, New =
> WS
> > 5. Text file (N), modified, same size: Old = 0.5 * WS + 0.5 * PS, New =
> WS
> > 6. Binary file, modified, same size: Old = 0.5 * WS + 0.5 * PS, New = WS
> > 7. Text file (T), modified, different size: Old = 0.5 * WS + 0.5 * PS,
> New = WS
> > 8. Text file (N), modified, different size: Old = 0, New = 0
> > 9. Binary file, modified, different size: Old = 0, New = 0
> >
> Hi Bert,
>
> I tested several different binary file formats for no-op/minimal change:
> 1. Microsoft Word (docx): change single character at the end of document:
> - filesize changes (case 9)
> - first change at offset 2,295 of 233,323
> 2. Microsoft Word (doc): change single character at the end of document:
> - filesize didn't change (case 6)
> - first change at offset 540 of 479,232
> 3. sqlite database: insert one row to wc.db (2.5mb)
> - filesize didn't change (case 6)
> - first change at offset 27
> 4. zip archive: change single character in one of many text files (43
> mb uncompressed)
> - filesize changes (case 9)
> - first change at offset 7,182,933 of 10,352,080
> 5. pdf file: no-op change of 800kb file
> - filesize changes (case 9)
> - first change at offset 47 of 854,971
> 6. Photoshop image (psd): change one pixel in the middle
> - filesize changes (case 9)
> - first change at offset 32 of 69,615,507
>
> With this in mind, I think we can improve the current approach so that
> in would be better in all possible cases. We could do this:
> 1. Open pristine file
> 2. Read first 4-16kb from pristine and normalized working file
> 3. Compare them: if they are equal then close pristine file, calculate
> SHA1 of normalized working file and compare it with checksum in wc.db.
>
> This behavior would only apply for only for files larger than some
> threshold (e.g 1mb) to make performance penalty for opening pristine
> file negligible.
>
> What do you think?
>
> --
> Ivan Zhakov
>
Received on 2017-02-20 09:05:42 CET

This message: [ Message body ]
Next message: Julian Foad: "Glob syntax for svndumpfilter exclude --pattern"
Previous message: Stefan: "Re: [PATCH] add support for zlibstat.lib in install-layout for Windows"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]