Re: race condition on checkin immediately followed by a merge?

From: Dan Stromberg <dstromberglists_at_gmail.com>
Date: Thu, 20 May 2010 11:02:20 -0700

Hi.

On Wed, May 19, 2010 at 11:28 PM, Daniel Widenfalk
<Daniel.Widenfalk_at_iar.se> wrote:
> Dan Stromberg wrote:
>> On Tue, May 18, 2010 at 1:20 AM, Bert Huijben <bert_at_qqmail.nl> wrote:
>>
>>>> Subversion uses a simple heuristic to determine if a file has
>>>> changed. It is well documented in the source code but perhaps
>>>> not so well known:
>>>>
>>>> If the size and time stamp of a file is the same as when checking
>>>> out the file it is immediately considered as unchanged. We had a
>>>> similar issue where an automatic script failed to commit an updated
>>>> file.
>>>>
>>>> This is most easily resolved by adding a small delay between checking
>>>> out and performing the modification. We used 2 seconds. You need to
>>>> wait long enough for your operating system file time stamp to tick
>>>> one step and on Windows this means about 2 seconds to be sure.
>>> On Windows and most operating systems this depends on the filesystem being
>>> used.
>>>
>>> On NTFS you should never see issues like these as the timestamps have 100
>>> nanosecond precision. On FAT (including FAT32, but not including the newer
>>> VFAT) however you *do* see this issue.
>>> (On linux: Ext3 has 1 second precision and Ext4 has nanosecond precision.
>>> HFS+ on the mac has 1 second precision)
>>>
>>> Subversion tries to compensate for 1 second precision filesystems, but this
>>> is not guaranteed accurate for network filesystems and systems that use 2
>>> second precision. (It waits until the time changed to the next second before
>>> returning from for these commands)

>> Folks seem to be saying that this is an issue in the SVN client's
>> working copy filesystem, not the SVN server's repository filesystem.
>> Is that correct?
>
> I may have misinterpreted your original problem.
>
> What I explained results in files not being committed at all!
> Your problem seems that a successful commit is not immediately
> available. Correct?

Yes, that is correct. I make a change in a WC, check in the change
from this WC, merge it into a different branch's WC, attempt to check
in the change from that 2nd WC. The merge and second checkin give
back successful exit codes (0), but the 2nd WC doesn't have the change
sometimes. But a subsequent checkout of the 2nd branch sees the
change in both the source and destination branches.

All of this is done automatically, so it's about as fast as we can get
it to go - written in CPython. Steps that require talking to the
server take a while though.

>> What part of this process would be dependent on the filesystem's
>> timestamp resolution:
>>
>> 1) Check out file n from srcurl rev 1000 to working copy 1, wc1
>> 2) Pull the sole value out of n - an integer
>> 3) Increment the integer
>
> Here is where you'd need to wait a small amount of time before
> updating the file.

I'm starting to wonder if a sleep shouldn't go at 1.5 above - after
the checkout, before the changes are made - so that the changes
clearly cause a newer timestamp.

>> 4) Check n back in to srcurl with the increased value to srcurl rev 1001
>> 5) Check out file n from dsturl rev 1000 to a new working directory,
>> wc2 (dsturl is a mirror of srcurl, but a little older)
>> 6) svn merge from srcurl at rev 1000 into wc2

Maybe the same freshness check applies here at step 7? So perhaps
we'd need to sleep after step 5 too?

>> 7) checkin modified wc2 to dsturl
>>
>> ?

I've modified my code for now to ensure that there is always a "new
second" starting after the end any SVN command - so it sleeps up to a
second. I won't know if that's solved the problem for a while,
because I still cannot replicate it on demand; if I leave my test that
fails running for a couple of days, 2..m runs apparently always
succeed. It's the 1st one that's relatively likely to fail, though it
often succeeds too.

>> I'm guessing it's a matter of step 7 (near the beginning, when a
>> freshness check is applied to n) seeing the same mtime and file size
>> on file n as it had in step 5 (near the end, when n is written). Is
>> that correct?
>
> I think that it is only in step 3, e.g. when doing the original
> update on the file, that you can stumble onto the file-not-changed
> heuristics. The svn merge will most probably work as expected.

A merge is also a kind of file change... I'm thinking perhaps the
subsequent checkin would be subject to the same issue?

> What server setup are you using? svnserve or Apache? Are you using
> write-through proxy with replication to read-only mirrors? Re-reading
> your description it seems that it is step 6 that fails (no new
> revision to merge).

It's Apache (not svnserve) with https and a "digital wallet"
(public/private key) for authentication. The server is run by
Collabnet. I'm not sure about whether there's a proxy involved on
Collabnet's side, however, we're skipping our local web proxy. I'm
unaware of any (automatic) mirroring. I still don't know what kind of
MPM they have at Collabnet, but the fact that I get the same error
sometimes at home with apache+http+prefork+"no digital wallet" is
telling. At home, I have a pretty stock Ubuntu Subversion setup.

> Since both server solutions are multi-threaded it might be that the
> processing of the commit in step (4) above has not been fully
> completed (and mirrored?) in time for when you initiate step (6).
> What kind of timing do you have between the steps? I would guess
> that 1-2-3-4 and 5-6-7 are fairly fast and that there may be some
> time N between steps 4 and 5?

It's all as fast as possible (automatic, scripted in Python), but the
steps that require contacting the remote server take a bit longer,
because the server is far away from us in terms of network topology
(which presumably is increasing latency significantly, and the
protocol is probably pretty latency-sensitive).

I don't have root on the box I'm doing development on, but I'm going
to see if I can get someone with root to mount me a tmpfs on it, so I
can do my unit tests in it for a while. If I don't see the problem
there for a while... That would suggest that it is indeed timestamp
resolution that's the issue, as tmpfs appears to have finer resolution
than ext3.

It just occurred to me that if it does turn out to be a matter of
parallelism on the subversion server, we might be able to do commit
serialization with precommit and postcommit hooks by adding some
locking. It'd probably slow things down, but if it makes things more
reliable it might be worth it.

Thanks :)
Received on 2010-05-20 20:02:47 CEST

This message: [ Message body ]
Next message: Talkov, Roger: "1.6.11 on Vista"
Previous message: Bob Archer: "RE: Merge and Diff problem"
In reply to: Daniel Widenfalk: "Re: race condition on checkin immediately followed by a merge?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]