[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: File modification detection?

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2004-04-18 19:22:08 CEST

Ben Collins-Sussman <sussman@collab.net> writes:

> Andy Whitcroft wrote:
>> It appears that the client file modification detection is
>> incorrectly checking modification time and assuming that if that is
>> unchanged the file is unchanged even in the face of the size/inode
>> number changing.
> That is exactly what's happening. SVN uses the same algorithm as CVS
> to detect if a file has changed:
> if (file's timestamp == one recorded in .svn/entries)
> return NOT_CHANGED
> else if (size of file != size of .svn/text-base/file)
> return CHANGED
> else
> compare file and .svn/text-base/file byte-by-byte.

That's not really correct, it would not handle keywords or eol-style.

> So what you're doing is 'mv'ing a different file with identical
> timestamp on top of the file, and the algorithm is falling down.
> Developers: is this a legitimate hole in our algorithm? Should we
> change it to cover this edge-case?

It's certainly not a new problem, it's been seen lots of times on
Windows with files have the wrong eol-style in the repository.
Initially the files show up as unmodified (while the timestamps
match), once the timestamps change the eol difference means the files
show up as modified.

> I'm worried that the only way to prevent this is to always-stat the
> text-base (which will slow down things even further, since we're doing
> two file stats instead of one), or to record the text-base's size in
> our entries file.

Ha! The current wc code makes very little attempt to minimise stat
calls (or any other system call). Try it: when a file is modified the
current status code will stat the text-base several times, not to
mention stating the working file several times as well. There is an
open issue about this.

In the past I have made some optimisations, but I don't think the
current code is ever going to be efficient as far as minimising system
calls is concerned. Too many of the interfaces use char* paths which
leads to repeated calls to svn_io_check_path and svn_io_stat.

I would guess that some of the existing "optimisations" may turn out
to be pessimisations. Take svn_wc_text_modified_p as an example: the
first thing it does is detect missing working files and that involves
a call to svn_io_check_path, i.e. a stat call. If the file is present
and the code will go on to compare timestamps, which involves another
stat call. Are missing files common? Is the missing file code an
optimisation or a pessimisation?

> Still, though, I could imagine somebody 'moving' a file with identical
> timestamp *and* identical size on top of the working file, and still
> fooling the algorithm. So is this road even worth pursuing? Is there
> some nice philosophical quip that allows us to brush off this whole
> scenario? Something like, "don't fool with timestamps, kid".

In the end a byte-by-byte comparison is the only way to guarantee that
changes are detected. Making that the default would make Subversion
very slow. In the past it has been suggested that this should be
under user control. It has also been suggested that the user be
allowed to select md5sums instead of byte-by-byte comparison, as that
would involve reading the working file but not the text-base, and so
might be an optimisation when the working copy is on a network disk.

Philip Martin
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Apr 18 19:22:31 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.