[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Data corruption bug in WC, apparently due to race condition?

From: Karl Fogel <kfogel_at_red-bean.com>
Date: Thu, 27 Jul 2017 15:27:52 -0500

Philip Martin <philip_at_codematters.co.uk> writes:
>The post-commit processing on the client side is not checking for
>modifications before recording filesize/timestamp in the nodes table in
>.svn/wc.db.

My hat is off to you for tracking this down, Philip! Thanks. Comments/questions at the end, after your transcript.

>In first terminal:
>
> $ svnadmin create repo
> $ svnmucc -mm put <(echo foo) file://`pwd`/repo/f
> $ svn co file://`pwd`/repo wc
> $ echo bar > wc/f
> $ gdb --arg svn ci -mm wc
> (gdb) b svn_client__do_commit
> (gdb) r
> hit breakpoint
> (gdb) fin
> run to end of svn_client__do_commit
>
>Switch to second terminal:
>
> $ svn st wc
> L wc
> M wc/f
> $ cat wc/.svn/pristine/*/*
> foo
> bar
> $ echo zigzag > wc/f
>
>Switch back to first terminal:
>
> (gdb) c
> (gdb) q
> $
>
>I believe that reproduces the problem:
>
> $ svn cat -r1 wc/f
> foo
> $ svn cat -r2 wc/f
> bar
> $ cat wc/f
> zigzag
> $ sqlite3 wc/.svn/wc.db "select translated_size from nodes where local_relpath='f'"
> 7
> $ svn st wc
> $
> $ touch wc/f # to break timestamp
> $ svn st wc
> M wc/f
>
>To fix this we would need to have the client's post-commit processing do
>a full-text comparison to catch modifications before storing the
>timestamp/size in .svn/wc.db. Avoid a race is a bit tricky, perhaps:
>
> 1) stat() to get timestamp/filesize
> 2) full-text compare to ensure text is unchanged
> 3) stat() to ensure timestamp/filesize is unchanged
> 4) store timestamp/filesize

I'm not familiar these days with the current archicture of libsvn_client and libsvn_wc, but just in principle, there should be an easier way to do this than the above, without re-comparing full-texts [1] (or, equivalently, re-calculating the hash).

When the client sends the file for commit, have it remember the timestamp, file size, and hash of the working file (as of the exact version that was used for the commit -- and if the file is being streamily appended to during the commit, or something like that, well, then just remember the relevant values for what was sent in the commit). Then during commit finalization, just store that remembered metadata, *not* metadata derived from the possibly-now-changed working file.

In other words, why isn't the commit process just taking a data-and-metadata "snapshot" of each working file, and using that snapshot for both the commit and the post-commit bookkeeping on the client side?

If the client were to do that, then if a working file gets modified during the commit, the file would naturally show up as modified afterwards without any special checks like your step (3) above. (I guess yet another way to say it is: your steps 1-4 are fine, but they should all happen as part of the commit, and all be done by the time the post-commit stage arrives.)

(Also, I thought this was how we were always doing things! But my memory is fuzzy, and/or things might have changed.)

Am I missing some subtlety about this?

Best regards,
-Karl

[1] In any case, a true full-text comparison should rarely be necessary. First we can look at the file size from the directory entry and see if it's as expected; in most cases it will differ if the contents differ, so that's the first "early out". Then we could look at the first 1024 (or whatever) bytes; many files, if they have changed, will show some change near the beginning, so that's a second "early out". I guess a third early-out would be to do an lseek into the middle and just see if the byte there is as expected? :-) But yes, eventually, a full-text comparison, i.e., a hash recalculation, may be necessary.
Received on 2017-07-27 22:28:03 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.