Johan Corveleyn wrote on Fri, 22 May 2020 10:02 +0200:
> On Fri, May 22, 2020 at 12:16 AM Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> >
> > Johan Corveleyn wrote on Thu, 21 May 2020 17:53 +0200:
> > > Trying to summarize this thread a bit. I apologize in advance if I
> > > forgot something, or have misrepresented any of the points that were
> > > raised (feel free to correct / add).
> > >
> >
> > I have additions, below.
> >
> > > Denis summed up the following problems that might happen while
> > > 'verify' locks the repcache.db:
> > >
> > > > 1. a post-commit error "database is locked"
> > > > 2. new representations will not be added in the rep-cache.db
> > > > 3. deduplication does not work for new data committed at this time
> > > > 4. commits work with delays.
> > >
> > > We have also established that the new tool build-repcache is not
> > > suitable for post-factum fixing of 3). It does not reprocess already
> > > committed revisions.
> >
> > Note, however, that (3) is simply an observation that Denis made about
> > the semantics of the incumbent code. We do not know of a user who needs
> > rep-cache.db entries to be added so soon after the commit that running
> > «build-repcache» after «verify» finishes would be too late (or impractical).
> >
> > > We are currently considering two approaches to address these issues:
> > >
> >
> > Hold on. Those issues are what happens *whenever* the post-commmit FS
> > processing INSERTs are starved. They happen when a «verify» starves
> > INSERTs, but they are known to have at least one other case: when two
> > commits happen in quick succession, the INSERTs that happen during
> > post-commit FS processing of the first commit can starve the second
> > commit's INSERTs. (Don't we have a jira issue for this variant? I
> > thought we did, but my searches came up dry.)
> >
> > Which is to say, the presentation of the matter as "four issues to be
> > fixed" isn't accurate. There are _eight_ issues to consider: (1) to (4)
> > when they are caused by a «verify» starving INSERTs, and (1) to (4) when
> > they are caused by an «INSERT» starving INSERTs.
> >
> > Thus, your analysis of the pros and cons is incomplete: it glosses over
> > the case of an «INSERT» starving other INSERTs.
>
> Good point. So those "4 problems" are all consequences of "rep-cache
> insertion starvation". One possible cause of this might be 'verify',
> but there are other known causes, such as 'locked by another commit'.
> I'm not sure how likely that last form of starvation is in practice,
> though (but if you've seen a jira issue of it, hmmm, maybe).
>
It would happen on /repos/asf every time certain PMCs commit a new
release's javadocs. (I'm not sure whether it still does; those PMCs
may have switched to git since I last looked.)
> > Furthermore, over the course of the thread several other ideas have been
> > floated, intended to address various subsets of the eight issues. The
> > "Perform INSERTs asynchronously" idea, for example (raised by me, but
> > inspired by a remark of Brane's), can basically fix all eight issues by
> > itself. There was also the idea to stop marshalling post-commit FS
> > processing errors to the client, which would fix both variants of (1)
> > but none of the other six. I think there were other ideas as well, but
> > I've run out of time for this thread for tonight, sorry.
>
> Ack.
> So IIUC these are addressing the consequences of the starvation (which
> might help regardless of the cause), rather than taking away the
> locking-causes.
>
Some and some.
- The OP is preventative for the «verify» cause and a no-op for other
causes.
- Sharding rep-cache is preventative: it effectively makes the critical
section shorter in duration, for all existing use-cases.
- Performing INSERTs asynchronously (and polylithically, i.e., in
N O(1)-sized transactions) is preventative for the cases of INSERTs
starving either other INSERTs or readers, and a no-op for the case of
readers starving INSERTs.
(Speaking of which, here's another thought: add an option flag to
build-repcache that makes it wait indefinitely in case it's locked out
of rep-cache.db by a concurrent process. Denis, WDYT?)
- Not marshalling post-commit FS processing errors to the client is
purely a failure mode handling thing.
- ⋮
> I think both 'angles' are quite useful. We might even want to combine
> ... mitigating the consequences, as well as avoiding some of the
> causes for the starvation (because even with consequence-mitigation
> things might not turn out ideal, or only be mitigated at the cost of
> X, so fixing some known causes can still be valuable).
>
Yeah, for example, the "don't marshal" thing probably makes sense
regardless of anything else. There's a reason GitHub doesn't page its
users when one of the disks in its RAID arrays needs replacement…
Cheers,
Daniel
Received on 2020-05-23 03:45:58 CEST