Fixing the FSFS corruption bug

From: Malcolm Rowe <malcolm-svn-dev_at_farside.org.uk>
Date: 2006-09-05 22:29:27 CEST

I think I have an idea about what's causing the FSFS corruption problem.

I've managed to produce a reproduction program that successfully commits
a corrupt transaction, and contrary to what I was expecting, it doesn't
involve threaded races, disk write errors, or ignoring API error returns.

Instead, it simply avoids closing the representation write stream,
and continues to write a new representation, like so:

  open transaction
    get representation write stream
      write some of the representation
  open transaction
    get representation write stream
      write all of the representation
    close representation write stream
    commit transaction

(Reopening the transaction isn't necessary, but I'm trying to emulate
what something that operates using multiple requests - like mod_dav_svn
- might be doing. Also note that I use svn_fs_apply_text() to get a
representation stream for simplicity, but I have no reason to assume that
svn_fs_apply_textdelta() wouldn't work as well. Note that the latter
explicitly has a documented prohibition against doing anything with the
transaction until the stream is closed, though it won't currently error
if you violate this constraint).

Importantly, I _haven't_ yet proved that mod_dav_svn (or any other caller)
is behaving in this way, but the corruption I'm getting from my program
_exactly_ matches the type of corruption we've seen reported.

I hypothesise that something is causing mod_dav_svn to stop writing the
representation (perhaps the client connection is terminated?) and then
the representation is resent (somehow), causing the problem shown above.

(The specific corruption is fairly straightforward, incidentally: the
write buffer associated with the first file handle sticks around until
the pool containing it is cleared, whereupon it's flushed to disk and
overwrites the start of the following representation.)

I've attached my sample program, and an initial attempt to fix the
problem. The method I've used is to create a new 'rev-lock' file within
the transaction directory for the duration of the representation write.
Then, whenever we need to write to the proto-rev file, we check whether
the rev-lock file exists, and bail if so.

Using a file-based locking method doesn't appear to cause any performance
degradation (though I'd appreciate someone checking on Win32, since I
believe it tends to have different performance characteristics in this
area), and it also has the benefits of being very simple, and, as a
bonus, preventing a partially-written transaction from being committed
(after e.g. rebooting during a write operation).

Of course, we won't be protected against users of other clients that don't
respect this lock, but I believe that most of the problems seem to occur
while using mod_dav_svn, so that's not such an issue. (The alternative
would be to bump the fs format number, but that seems excessive).

Thoughts?

[[[
* subversion/include/svn_fs.h
  (svn_fs_apply_text): Document the requirement that the caller close
    the returned stream before further operating on the transaction,
    using the same wording as svn_fs_apply_textdelta() already provides.

* subversion/include/svn_error_codes.h
(SVN_ERR_FS_REP_BEING_WRITTEN): New.

* subversion/libsvn_fs_fs/fs_fs.c
  (PATH_REV_LOCK, path_txn_proto_rev_lock): New.
  (get_writable_proto_rev): New helper function, locks the proto-rev
    file and returns a handle to it.
  (unlock_proto_rev): New helper function, unlocks the proto-rev file.
  (rep_write_get_baton): Call get_writable_proto_rev() to open the
    proto-rev file.
  (rep_write_contents_close): Call unlock_proto_rev() to unlock the
    proto-rev file.
  (commit_body): Call get_writable_proto_rev() to open the proto-rev file.

* subversion/libsvn_fs_fs/structure
Add documentation for the new (temporary) rev-lock transaction file.
]]]

Regards,
Malcolm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

text/plain attachment: break-fsfs.c

text/plain attachment: fsfs-corruption-fix.patch

Received on Tue Sep 5 22:33:01 2006

This message: [ Message body ]
Next message: Daniel Rall: "[proposal] Handling of contested changes near release finalization"
Previous message: Matthias Miller: "[PATCH] issue #850: remove references to internal APR headers"
Next in thread: Malcolm Rowe: "Re: Fixing the FSFS corruption bug"
Reply: Malcolm Rowe: "Re: Fixing the FSFS corruption bug"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]