Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Sun, 3 May 2020 23:00:04 +0000

Denis Kovalchuk wrote on Sun, 03 May 2020 21:04 +0300:
> > For that matter, now that build-repcache is a thing, why should commits
> > populate rep-cache synchronously at all? Why shouldn't we provide
> > a mode that says "svn_fs_commit_txn() won't populate the rep-cache for
> > you; you're expected to do it yourself via a post-commit hook"? That'll
> > sidestep the problem entirely. (Brane made a similar point a few days
> > ago.)
> >
> > > 2) These commits will complete with a post-commit error.
> >
> > There are other ways to address this. For example, whenever the
> > post-commit insertion to rep-cache.db fails, instead of reporting the
> > error to the client, we could record the error in the server log but not
> > report it to the client. After all, generally the client doesn't care
> > about that error and can't do anything to fix it.
>
> I am not too sure the moment of time when the rep-cache is populated and way to
> report errors are currently causing problems.

Hang on. In your previous reply you wrote that commits that "complete
with a post-commit error" was a problem the patch was designed to solve,
but now you say that's not an immediate concern. Which is it? We can't
design a solution if it's not clear what set of problems the solution is
supposed to solve.

Would you please state clearly the _complete_ set of problems which you
would like addressed (whether via the patch in the OP or otherwise)?
Once that is clarified, we can consider what options are available to
address them.

> I think the root cause of the problem is that the entire rep-cache
> verification runs with holding a SQLite lock. And I propose to fix it
> by slightly modified approach in the patch so that errors do not
> happen at all.
>
> I think that with asynchronous rep-cache population the same errors will
> happen, but at another time.

Yes, they will. Populating rep-cache asynchronously is not a silver
bullet: it will address root problems (1) and (2) [using the numbering
from your previous email], but not (3). To solve (3), additional steps
will have to be taken (whether ones I have proposed or others).

There is also a (4): populating rep-cache asynchronously will cause
commits to take less time, as measured by clients.

> Even if errors are hidden from clients, administrators may not have a
> way to fix them.

"Administrators may not have a way to fix them"? How might this happen?

You mentioned below that administrators may not be able to run
«build-repcache» because they won't be able to ascertain a window of
time during which «verify» won't run. If that's what you refer to here,
then see my response below.

> > (There are things other than «verify» that can cause the "post-commit FS
> > processing" rep-cache INSERT to time out and error out. For example,
> > that's known to happen to someone who makes a small commit that just
> > happens to occur right after a commit that has released the write lock
> > but hasn't finished inserting to rep-cache.db. I don't have the issue
> > number handy, sorry.)
>
> If I understand correctly, this is the case when a large number of entries are
> added in the rep-cache during the post-commit processing stage that leads to a
> starvation for other additions.

Yes.

> It looks like a separate problem that can be fixed. And I assume that
> a similar approach can be used for this: add entries in small batches,
> instead of adding all the entries in one transaction.

That's one option. Making all INSERTs asynchronous is another option
(and is not mutually exclusive of the first option). There may be yet
other options.

> Also I think that the problem should occur less often than the mentioned one
> with verification, because:
> - During a commit, the lock is held only for the duration of adding new entries
> to the rep-cache.
> - During a verification, the lock is held for the duration of all content
> verification, that can take a lot of time.
>

Sure.

> In general, It seems to me that the case of committing a large number of
> changes may be more rare than just verifying a repository with a large
> rep-cache. So I propose to fix the problem with verification first.

Sure.

(To be clear, I don't think that focussing on the issue of «verify»
starving INSERTs should rule out of consideration solutions that fix
_both_ that issue and the issue of INSERTs starving other INSERTs. It
would be fine to fix either issue without the other, certainly, but
solutions that fix both problems simultaneously should be considered
too.)

> > > 3) No entries will be added to rep-cache.db. Unless fixed, following commits
> > > may work without deduplication.
> >
> > Sure, but why is it not sufficient to run build-repcache afterwards?
>
> For example, 'svnadmin verify' can be checking rep-cache for a long time and
> 'svnadmin build-repcache' will keep completing with an error in this case. All
> commits during this time will work without deduplication for new data.

I take it that in your use-case it's important that a commit's rep-cache rows
should be added as soon after the commit as possible, then?

> > What do you mean, "at the right time"? Under what circumstances would
> > the following not suffice? —
> >
> > set -e
> > old=`svnlook youngest r`
> > svnadmin verify r
> > svnadmin build-repcache r -r ${old}:HEAD
>
> If I am not mistaken, this assumes that only one maintenance operation can
> happen with repository at one time. But in general this approach requires a
> guarantee that there are no other running verifications, because otherwise
> 'svnadmin build-repcache' may fail with an error.

Yes, and it's up to the repository administrator to provide such
guarantees when scheduling «verify» and «build-repcache» operations on
their repository.

On our part, we don't administer the repository; we merely provide tools
that allow the repository to be administered. We should make these
tools impose as few constraints as possible on the repository
administrator. At the same time, however, we should ensure that
«verify» detects any corruptions that may be present.

> > Fair point, but the semantics of SQLite are what they are, and we have
> > to work with that. (Unless you propose to replace rep-cache.db by some
> > non-SQLite storage? Maybe SQLite isn't the right tool for the job after
> > all, even if it's the right tool for wc.db? Or, alternatively, can we
> > achieve finer-grained locking within SQLite?)
>
> I do not think that there is a problem with using SQLite for these purposes and
> I do not propose to replace it. I think the problem is how SQLite is used
> during verification where unbounded statement currently leads to the problem
> with concurrent writes to the rep-cache.

I understand, and I stand by my questions. I would say that the root
problem here is that rep-cache.db is a monolithic design, which our
users outgrow as time passes. So, should we change to a more polylithic
design, in order to allow finer-grained locking?

For example, how about sharding rep-cache.db to multiple separate
databases, according to the most significant nibbles of the hash value?
This way, instead of having one big database, we'd have 2**(4N) smaller
databases.

> > When is it not possible to run «build-repcache» and «verify» sequentially?
>
> For example, if there is no way to guarantee that another verification does not
> run at this time.

See above.

> > > 2) If new entries are added to the rep-cache.db during its verification, the
> > > new approach has a guarantee that all entries that existed at the time of the
> > > call of 'svnadmin verify' will be verified.
> >
> > Yes, this too. However, why would you assume in this case that just
> > because every row was valid when it was accessed during verification,
> > every row that was present when verification began is valid when the
> > verification finishes?
> >
> > In the current code, the answer to that question is "Because the
> > verification process took an SQLite lock, which made the table-reading
> > atomic".
> >
> > I don't know what the answer to that question is with this patch. How
> > can you be sure that, at a particular moment in time, every row in the
> > table was valid simultaneously, if you don't read the table atomically?
>
> We maintain separate cursor (sha1_digest) to the entries and walk them ordered.
> The row values are immutable after they are added. So in new approach we will
> walk at least all rows that existed at the time of the call of
> 'svnadmin verify' and with the same values.
>
> The verification of a single row depends only on the values in that row. These
> values are immutable. The revision content on fsfs is also immutable. Based on
> this, I assume that if we do not find errors in accessed rows, there are no
> errors in rows that existed at the time of the call of 'svnadmin verify'.

I understand this argument. However, at the end of the day, the patch
makes «verify» read a database as it's being written to, which is more
likely to run into bugs (either in our code or in SQLite's), so I'm not
a fan of it.

Maybe I'm being overly careful.

Cheers,

Daniel

P.S. How about using creating a temporary copy of rep-cache.db, and then
have «verify» iterate that at its leisure and delete it when done with it?
Received on 2020-05-04 01:00:22 CEST

This message: [ Message body ]
Next message: Yasuhito FUTATSUKI: "Re: [PATCH] Fix differences of SWIG_PY_OPTS between Unix and Windows"
Previous message: Johan Corveleyn: "Re: Problems running testsuite on Windows with Python 3"
In reply to: Denis Kovalchuk: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"
Next in thread: Denis Kovalchuk: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"
Reply: Denis Kovalchuk: "Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]