Re: [PATCH] Fix "database is locked" error when committing during rep-cache.db verification

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 7 May 2020 21:58:40 +0000

Denis Kovalchuk wrote on Tue, 05 May 2020 14:35 +0300:
> > Hang on. In your previous reply you wrote that commits that "complete
> > with a post-commit error" was a problem the patch was designed to solve,
> > but now you say that's not an immediate concern. Which is it? We can't
> > design a solution if it's not clear what set of problems the solution is
> > supposed to solve.
>
> I don't think there is a contradiction somewhere. What I was trying to say is:
> - The problem is that rep-cache verification leads to errors for other writes
> to the rep-cache. For example, this happens during commit post-processing.
> - One visible consequence of this problem is that commits can hang.
> - Another visible consequence is that commits complete with post-commit errors.
> - Another visible consequence is that new data will not be deduplicated.
> - I think the root cause of problem is that entire rep-cache verification runs
> with holding a SQLite lock for a long time.
> - I don't think that there are issues with the existing way to report
> post-commit errors to clients. Also, I don't think that changing it would be
> relevant to the problem.

Thanks for clarifying this. However, given that you simultaneously say
that commits terminating with errors is a "consequence" of the problem,
and that you think fixing that "[wouldn't] be relevant to the problem",
I don't understand exactly which problems you had _set out_ to fix; i.e.,
which problems' fixings were _a priori_ design goals of the patch, as
opposed to other effects of the patch, which, nice-to-have though they
may be, are dispensable insofar as your use-case is concerned.

For example, my understanding that "don't spam clients with post-commit
FS processing errors when a commit is concurrent with a «verify»" falls
into the latter class.

What I'm trying to do here is to get a _problem statement_ to which your
patch is a solution, so we can consider other solutions. You wrote
above
.
> rep-cache verification leads to errors for other writes to the rep-cache
.
which is accurate, but doesn't explain _why_ that is a problem (other
than in terms of the three listed consequences, but you also said fixing
at least one of _those_ wouldn't address your use-case); and you also
wrote
.
> rep-cache verification runs with holding a SQLite lock for a long time
.
which is an accurate description of the implementation, but is not
a problem statement.

I hope I don't sound like a broken record here, but I do think getting
a clear problem statement is important. Importantly, the problem
statement should be made _without reference to the patch_. Explain
where the incumbent code falls short and in what way.

> > > For example, 'svnadmin verify' can be checking rep-cache for a long time and
> > > 'svnadmin build-repcache' will keep completing with an error in this case. All
> > > commits during this time will work without deduplication for new data.
> >
> > I take it that in your use-case it's important that a commit's rep-cache rows
> > should be added as soon after the commit as possible, then?
>
> I think it is important that entries don't get lost and do not make
> deduplication worse.

Nobody is proposing that entries be lost. The question is simply how
soon after the commit entries are required to be added to the rep-cache.
The question is _when_, not _whether_.

> The rep-cache verification can take a long time for real-world
> reportistories, for example an hour or more. And all commits during
> this time will work without deduplication for new data and I think
> that currently there is no way to fix it.

Are you being facetious? *You* wrote the build-repcache command, which
is the "way to fix it".

> In case of regular verifications, this can be a significant issue.
>
> > Yes, and it's up to the repository administrator to provide such
> > guarantees when scheduling «verify» and «build-repcache» operations on
> > their repository.
>
> I think that in general it may be difficult to have such guarantees. For
> example, verification can be performed with the API or by third-party tools.

I'm sorry, but I am not convinced that a repository administrator is
unable to control when their own repository gets verified — even if they
use third-party tools. Furthermore, were I so convinced, my next
question would be why locking can't be arranged (either by the
administrator, or in libsvn) to prevent «verify» and «build-repcache»
from running concurrently.

As to your implication that an administrator would be unable to control
access to their own repository because they use the API as opposed to
the CLI, I am having trouble imagining a scenario in which that would be
the case. Would you illuminate me?

> And based on the above it would mean that it is necessary to also prohibit
> commits during verification.

Commits need not be prohibited during verifications. (Besides, such
a requirement would constitute an incompatible API change unless
accompanied by a format bump.)

> The limitation seems too strong and so I think it would be better to
> fix the problem with holding a SQLite lock during entire verification.
>
> > For example, how about sharding rep-cache.db to multiple separate
> > databases, according to the most significant nibbles of the hash value?
> > This way, instead of having one big database, we'd have 2**(4N) smaller
> > databases.
>
> If I am not mistaken, the sizes of such shards will still be O(N), but not
> O(1), as in the patch. Therefore, this problem can still occur at some
> rep-cache sizes.

(Please don't overload variables; that makes conversation needlessly
difficult.)

With the proposal, the number of shards will be O(number of reps /
2**(number of significant bits used to identify a shard)). That's an
inverse exponential in the number of bits used to identify a shard, so
if the shards grow too large in the admin's opinion, the admin simply
has to double the number of bits used to identify a shard.

> > I understand this argument. However, at the end of the day, the patch
> > makes «verify» read a database as it's being written to, which is more
> > likely to run into bugs (either in our code or in SQLite's), so I'm not
> > a fan of it.
> >
> > Maybe I'm being overly careful.
>
> I think that concurrent reading and writing to rep-cache.db is already possible
> in the case of concurrent commits. It is a bit unclear to me why the (optional)
> verification has to be different from the commit in this aspect.

Because «verify» exists to verify invariants do in fact hold, and
it is not obvious that can be accomplished (without false negatives) by
reading the database as it's being modified.

> Especially considering that the current approach leads to the problem.

I'm not disputing the problem. I'm concerned about the proposed
solution and asking to consider alternatives.

> > P.S. How about using creating a temporary copy of rep-cache.db, and then
> > have «verify» iterate that at its leisure and delete it when done with it?
>
> I think that compared to the patch this approach has a problem that every
> verify will lead to a lot of write operations. For example, the size of
> rep-cache.db in the ASF repository is 1.1 GB.
>

Yes, but on the other hand:

- The writes can be done on a RAM disk
(or maybe an atomic filesystem snapshot can be taken?)

- The chances of «verify» false negativing will be lower

(Also, there's the option of rsync'ing the repository somewhere and
running «verify» on the copy. Or the shards solution mentioned above.)

> Just in case, I will be happy to work on any specific things in the patch, if
> there are some things you think can be improved.

I read the patch when you posted it. IIRC, it correctly implements the
approach you chose. However, I am sceptical about that approach. As
mentioned above, I don't understand why you would be certain that, once
«verify» ends, every single row in the database would be readable and
would emit correct values, even granting that every single row in the
database had that property at some point during the run.

Yes, SQLite promises us that that'd be the case, but why should
«verify», of all places, rely on that promise?

Cheers,

Daniel
Received on 2020-05-07 23:58:57 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]