Hi Stefan,
I recently stumbled upon this changeset and I had some time to review it. I
think that the things should not be done this way. As I see it, this change
breaks one fundamental aspect of the non-incremental hotcopy, adds an ugly
hack to the fs_hotcopy() layer and can lead to a potential deadlock between
'hotcopy' and 'pack' operations. See my comments inline.
> - hotcopy_setup_shared_fs_data(src_fs, dst_fs);
> - SVN_ERR(svn_fs_fs__initialize_caches(dst_fs, pool));
> + /* FS creation is complete. Stamp it with a format file. */
> + SVN_ERR(svn_fs_fs__write_format(dst_fs, TRUE, pool));
>
> return SVN_NO_ERROR;
> }
The whole idea of the non-incremental hotcopy is that it *does not* make the
destination visible until the hotcopy finishes. This is how the things worked
prior to this changeset and this is how the svn_repos layer currently works.
By writing the format file before actually doing something we do not only
expose our internal stub to the world (which really is a stub without even r0),
but will also leave this internal stub openable in case the hotcopy fails.
This stub does not even have to be a normal filesystem and we only need
it to bootstrap the hotcopy process, but we now behave as if it was!
Finally, the 'repos' and 'fs' layers now behave inconsistently in these terms,
i.e. svn_repos_hotcopy3() still writes the format file only when everything is
done.
> + SVN_ERR(svn_fs_fs__hotcopy_prepare_target(src_fs, dst_fs, dst_path,
> + incremental, pool));
> + dst_fs->fsap_data = NULL;
> +
> + /* Now, the destination repo should open just fine. */
> + SVN_ERR(fs_open(dst_fs, dst_path, common_pool_lock, pool, common_pool));
This is the hack I was talking about. Clearing FSAP_DATA here is only required
to pass the precondition check performed by svn_fs__check_fs() within the
fs_open() call. By the way, how should an arbitrary caller use the introduced
svn_fs_fs__hotcopy_prepare_target() function? Should he/she *always* clear
that field in order to be able to open the filesystem?
> +/* Wrapper around hotcopy_body taking out all necessary source repositories.
> + */
> +static svn_error_t *
> +hotcopy_locking_src_body(void *baton, apr_pool_t *pool)
> {
> - fs_fs_data_t *src_ffd = src_fs->fsap_data;
> - fs_fs_data_t *dst_ffd = dst_fs->fsap_data;
> + struct hotcopy_body_baton *hbb = baton;
> + fs_fs_data_t *src_ffd = hbb->src_fs->fsap_data;
>
> - /* The common pool and mutexes are shared between src and dst filesystems.
> - * During hotcopy we only grab the mutexes for the destination, so there
> - * is no risk of dead-lock. We don't write to the src filesystem. Shared
> - * data for the src_fs has already been initialised in fs_hotcopy(). */
> - dst_ffd->shared = src_ffd->shared;
> + return src_ffd->format >= SVN_FS_FS__MIN_PACK_LOCK_FORMAT
> + ? svn_error_trace(svn_fs_fs__with_pack_lock(hbb->src_fs, hotcopy_body,
> + baton, pool))
> + : hotcopy_body(baton, pool);
> }
Here is the potential deadlock. Say, someone runs 'svnadmin freeze svnadmin
hotcopy' and at the same moment 'svnadmin pack' is executed for the repository.
Packing takes the pack-lock for the whole operation, but sometimes also needs
to take a nested write-lock within the pack_shard() function. This is what
could happen:
1. freeze (takes the db/write-lock, inner 'hotcopy' still pending to execute)
2. pack (takes the db/pack-lock, blocks trying to acquire the db/write-lock)
3. hotcopy (blocks trying to acquire the db/pack-lock, **deadlock!**)
Personally, I also find this hunk cryptic due to the usage of the non-obvious
ternary operator and due to another added layer of the indirection (now
svn_fs_fs__hotcopy() calls hotcopy_locking_src_body() calling hotcopy_body()).
The comment is also misleading. What exactly are the "source repositories" in
this context?
Could you please fix these problems or entirely revert this change?
Regards,
Evgeny Kotkov
Received on 2014-07-28 02:28:21 CEST