Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

From: Denis Kovalchuk <denis.kovalchuk_at_visualsvn.com>
Date: Sun, 3 May 2020 21:04:31 +0300

> For that matter, now that build-repcache is a thing, why should commits
> populate rep-cache synchronously at all? Why shouldn't we provide
> a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
> you; you're expected to do it yourself via a post-commit hook"? That'll
> sidestep the problem entirely. (Brane made a similar point a few days
> ago.)
>
> > 2) These commits will complete with a post-commit error.
>
> There are other ways to address this. For example, whenever the
> post-commit insertion to rep-cache.db fails, instead of reporting the
> error to the client, we could record the error in the server log but not
> report it to the client. After all, generally the client doesn't care
> about that error and can't do anything to fix it.

I am not too sure the moment of time when the rep-cache is populated and way to
report errors are currently causing problems. I think the root cause of the
problem is that the entire rep-cache verification runs with holding a SQLite
lock. And I propose to fix it by slightly modified approach in the patch so
that errors do not happen at all.

I think that with asynchronous rep-cache population the same errors will
happen, but at another time. Even if errors are hidden from clients,
administrators may not have a way to fix them.

> (There are things other than Â«verifyÂ» that can cause the "post-commit FS
> processing" rep-cache INSERT to time out and error out. For example,
> that's known to happen to someone who makes a small commit that just
> happens to occur right after a commit that has released the write lock
> but hasn't finished inserting to rep-cache.db. I don't have the issue
> number handy, sorry.)

If I understand correctly, this is the case when a large number of entries are
added in the rep-cache during the post-commit processing stage that leads to a
starvation for other additions. It looks like a separate problem that can be
fixed. And I assume that a similar approach can be used for this: add entries
in small batches, instead of adding all the entries in one transaction.

Also I think that the problem should occur less often than the mentioned one
with verification, because:
- During a commit, the lock is held only for the duration of adding new entries
to the rep-cache.
- During a verification, the lock is held for the duration of all content
verification, that can take a lot of time.

In general, It seems to me that the case of committing a large number of
changes may be more rare than just verifying a repository with a large
rep-cache. So I propose to fix the problem with verification first.

> > 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> > may work without deduplication.
>
> Sure, but why is it not sufficient to run build-repcache afterwards?

For example, 'svnadmin verify' can be checking rep-cache for a long time and
'svnadmin build-repcache' will keep completing with an error in this case. All
commits during this time will work without deduplication for new data.

> What do you mean, "at the right time"? Under what circumstances would
> the following not suffice? â€”
>
> set -e
> old=`svnlook youngest r`
> svnadmin verify r
> svnadmin build-repcache r -r ${old}:HEAD

If I am not mistaken, this assumes that only one maintenance operation can
happen with repository at one time. But in general this approach requires a
guarantee that there are no other running verifications, because otherwise
'svnadmin build-repcache' may fail with an error.

> Fair point, but the semantics of SQLite are what they are, and we have
> to work with that. (Unless you propose to replace rep-cache.db by some
> non-SQLite storage? Maybe SQLite isn't the right tool for the job after
> all, even if it's the right tool for wc.db? Or, alternatively, can we
> achieve finer-grained locking within SQLite?)

I do not think that there is a problem with using SQLite for these purposes and
I do not propose to replace it. I think the problem is how SQLite is used
during verification where unbounded statement currently leads to the problem
with concurrent writes to the rep-cache.

> When is it not possible to run Â«build-repcacheÂ» and Â«verifyÂ» sequentially?

For example, if there is no way to guarantee that another verification does not
run at this time.

> > 2) If new entries are added to the rep-cache.db during its verification, the
> > new approach has a guarantee that all entries that existed at the time of the
> > call of 'svnadmin verify' will be verified.
>
> Yes, this too. However, why would you assume in this case that just
> because every row was valid when it was accessed during verification,
> every row that was present when verification began is valid when the
> verification finishes?
>
> In the current code, the answer to that question is "Because the
> verification process took an SQLite lock, which made the table-reading
> atomic".
>
> I don't know what the answer to that question is with this patch. How
> can you be sure that, at a particular moment in time, every row in the
> table was valid simultaneously, if you don't read the table atomically?

We maintain separate cursor (sha1_digest) to the entries and walk them ordered.
The row values are immutable after they are added. So in new approach we will
walk at least all rows that existed at the time of the call of
'svnadmin verify' and with the same values.

The verification of a single row depends only on the values in that row. These
values are immutable. The revision content on fsfs is also immutable. Based on
this, I assume that if we do not find errors in accessed rows, there are no
errors in rows that existed at the time of the call of 'svnadmin verify'.

Regards,
Denis Kovalchuk
Received on 2020-05-03 20:04:53 CEST

This message: [ Message body ]
Next message: Johan Corveleyn: "Re: Problems running testsuite on Windows with Python 3"
Previous message: Yasuhito FUTATSUKI: "Re: Problems running testsuite on Windows with Python 3"
In reply to: Daniel Shahaf: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"
Next in thread: Daniel Shahaf: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"
Reply: Daniel Shahaf: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]