Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Fri, 1 May 2020 23:29:41 +0000

Denis Kovalchuk wrote on Sat, 02 May 2020 00:28 +0300:
> > So the failure mode is that rep-cache.db entries are not able to be
> > added while rep-cache.db is being verified. I am not convinced that
> > this is a problem.
>
> I think that this case leads to multiple problems:
> 1) Commits happening at the same time as verify will hang for the duration of
> sqlite busy timeout.

Yeah, they hang for 10 seconds:

[[[
subversion/libsvn_subr/sqlite.c|218 col 22| #define BUSY_TIMEOUT 10000
subversion/libsvn_subr/config_file.c|1552 col 50| "### returning an error. The default is 10000, i.e. 10 seconds." NL
subversion/libsvn_subr/config_file.c|1554 col 27| "# busy-timeout = 10000" NL
]]]

Would it help to make this configurable? The BUSY_TIMEOUT pragma is
per-process so the admin can't just set it out of band, and the above
"busy-timeout" config file knob is used by libsvn_wc only, so this is
not currently configurable.

For that matter, now that build-repcache is a thing, why should commits
populate rep-cache synchronously at all? Why shouldn't we provide
a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
you; you're expected to do it yourself via a post-commit hook"? That'll
sidestep the problem entirely. (Brane made a similar point a few days
ago.)

> 2) These commits will complete with a post-commit error.

There are other ways to address this. For example, whenever the
post-commit insertion to rep-cache.db fails, instead of reporting the
error to the client, we could record the error in the server log but not
report it to the client. After all, generally the client doesn't care
about that error and can't do anything to fix it.

Note: All the above suggestions, for both (1) and (2), would fix _all_
causes of (1) and (2), whereas your proposal would prevent (1) and (2)
only when they are caused by a concurrent Â«verifyÂ» operation, but not
otherwise.

(There are things other than Â«verifyÂ» that can cause the "post-commit FS
processing" rep-cache INSERT to time out and error out. For example,
that's known to happen to someone who makes a small commit that just
happens to occur right after a commit that has released the write lock
but hasn't finished inserting to rep-cache.db. I don't have the issue
number handy, sorry.)

> 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> may work without deduplication.

Sure, but why is it not sufficient to run build-repcache afterwards?

> Repository verification can be a regular procedure and can take a long time for
> large repositories. And in that case the mentioned problems can also happen
> regularly.
>
> > The obvious workaround, in case someone runs into that, is for the
> > admin to run build-repcache afterwards. (Incidentally, we might want
> > have the "post-commit FS processing failed" error message namedrop
> > Â«svnadmin build-repcacheÂ» when the underlying error code indicates an
> > SQLite error.)
>
> I assume that the patch fixes the problem in such a way that it does not
> require any additional workarounds and does not worsen any other
> characteristics.

> At the same time, I would like to note that the workaround does not
> fix all problems. It fixes only 3) and only if it is called at the
> right time.

What do you mean, "at the right time"? Under what circumstances would
the following not suffice?Â â€”

    set -e
    old=`svnlook youngest r`
    svnadmin verify r
    svnadmin build-repcache r -r ${old}:HEAD

> > Yes, build-repcache will error out in this situation. Why is that a problem?
>
> I think this is a problem for similar reasons. It may be unexpected that the
> read-only verify operation can lead to an error when another maintenance
> operation such as 'svnadmin build-repcache' is called.

Fair point, but the semantics of SQLite are what they are, and we have
to work with that. (Unless you propose to replace rep-cache.db by some
non-SQLite storage? Maybe SQLite isn't the right tool for the job after
all, even if it's the right tool for wc.db? Or, alternatively, can we
achieve finer-grained locking within SQLite?)

However, the foremost consideration for Â«svnadmin verifyÂ» is correctness.
Unintuitive behaviour can be dealt with via documentation, informative
error messages, and so on.

> To completely avoid failures, it may be necessary to separate
> operations by time, and this is not always possible.

When is it not possible to run Â«build-repcacheÂ» and Â«verifyÂ» sequentially?

> > What effect does the patch have on _correctness_ of verification? What
> > kinds of corruptions can be detected by the incumbent code but not with
> > your patch?
> >
> > Isn't there an alternative approach that does not affect the correctness
> > of verification as much as this approach?
>
> The patch does not change the verification process, only the process of
> fetching entries. In the previous approach entries were fetched during one
> large query and now they are fetched in small batches.
>
> In this way:
> 1) If rep-cache.db is not changed during its verification, the new approach
> should be equivalent to the previous one.

Agreed.

> 2) If new entries are added to the rep-cache.db during its verification, the
> new approach has a guarantee that all entries that existed at the time of the
> call of 'svnadmin verify' will be verified.

Yes, this too. However, why would you assume in this case that just
because every row was valid when it was accessed during verification,
every row that was present when verification began is valid when the
verification finishes?

In the current code, the answer to that question is "Because the
verification process took an SQLite lock, which made the table-reading
atomic".

I don't know what the answer to that question is with this patch. How
can you be sure that, at a particular moment in time, every row in the
table was valid simultaneously, if you don't read the table atomically?

> So I think that it should not affect the correctness of verification.

I hope it doesn't, but I also know that it's all too easy to overlook
bugs in concurrent code that uses non-atomic operations.

For example, back when 1.7.0 was in the pre-release stage we had to
remove an entire new SQLite-using feature because we all overlooked
a concurrency bug in it until it was pointed out to us. Nowadays our
concurrency correctness requirements are just as stringent as then, but
we have far fewer eyeballs around to find bugs with.

Cheers,

Daniel
Received on 2020-05-02 01:29:54 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]