Re: Queries about rep cache: get_shared_rep()

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Fri, 29 May 2015 09:26:42 +0200

On Wed, May 27, 2015 at 6:35 PM, Julian Foad <julianfoad_at_gmail.com> wrote:
> Stefan Fuhrmann wrote:
>> Alright. I gave it a bit more thought now.
>>
>> Whenever we encounter this mismatch, something pretty
>> bad likely happened to the repo - such as a failed restore
>> attempt. In turn, we can expect those situations to be
>> very rare - which means we can afford some disruption
>> for the user.
>>
>> I suggest that we do 3 things:
>>
>> * log the warning - for future reference, for being picked
>> up by monitoring tools etc.
>
> We already do that.

Oh, absolutely. I just didn't mention it.

>> * clear the rep-cache.db
>
> Clearing the cache and continuing operation may make subsequent
> commits much larger than they should be, and there is no easy way to
> undo that if it happens.

Rep-sharing typically reduces the repo size by 25% (e.g. Apache)
to 60% (wordpress, inexperienced users using plain ADD for tags).

Assuming that most rep-sharing is relatively local, i.e. over the
span of a "few" revisions, e.g. due to catch-up merges between
branches, most of the inefficiency will only be temporary.
In short: no major impact.

> Attempting to clear the rep cache might itself fail in some way,
> depending on what kind of corruption has happened to it. It would also
> destroy the evidence of what went wrong.

That is a good point. Two good points, actually.

>> * fail the current commit
>>
>> That way, we can be quite sure that only valid data gets
>> committed.
>
> Failing the current commit will ensure that no potentially bad (but
> undiagnosed) response from the rep cache has already been used in an
> earlier part of the transaction. I suppose that's what you're thinking
> of. That makes sense to me.

Yes that and the rep cache also beging used to validate for the
incoming data - even if it is very unlikely that we mess up the
server-side SHA1 calculation of the fulltext stream.

>> Alternatively, we could block any commit
>> (inventing some new repo state) until the admin resolves
>> the situation manually. Not sure which one I would prefer.
>
> I suggest this is the best option, unless we specifically design and
> the administrator specifically chooses an option to have higher
> availability at the expense of disk space, fault diagnosis, and so on.

We could add a "continue-upon-failure" option to the
[rep-cache] section in fsfs.conf. Default would be "false".
If set to true, commits would not be held off by rep-cache
failures but the rep-cache would be disabled. If set to
false, the repo goes into a r/o state.

>> On top of that, we should handle the other rep-cache.db
>> consistency checks (e.g. head vs. rev of latest entry)
>> the same way.
>
> That makes sense.
>
> I suggest all of this should be treated as a possible future
> enhancement, not anything urgent.

I agree. In particular because it will require a format bump
for putting the "r/o" or "corruption" indicator somewhere.

-- Stefan^2.
Received on 2015-05-29 09:26:58 CEST

This message: [ Message body ]
Next message: Johan Corveleyn: "Re: Populating the rep-cache"
Previous message: Stefan Fuhrmann: "Re: Efficient and effective fsync during commit"
In reply to: Julian Foad: "Re: Queries about rep cache: get_shared_rep()"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]