On Tue, Dec 30, 2014 at 5:01 PM, Branko Čibej <brane_at_wandisco.com> wrote:
> On 30.12.2014 16:35, Ivan Zhakov wrote:
> > Just to make things clear. The thorough review was already provided.
> > In short, subpools are not a solution for the root cause of the
> > problem being discussed.
>
The last bit is where we disagree. I see more then one
issue here:
1. There is an L1 DAG node cache.
Problem: Added complexity.
2. It uses (now: used) ref counting to allow for copy-less access.
Problems: Lock management adds complexity. Binding lock
lifetime to pools triggered the memory consumption problem.
3. Poor isolation of API users from implementation details.
This is what allowed problem (2) to manifest outside FSFS.
4. Having single-pool API functions.
Problems: No clear distinction between data that needs to be
preserved (results) and temporaries. Attempts to tighten pool
usage will likely introduce bugs or break implicit / undocumented
assumptions. It is also unclear - without reading the docstrings
carefully - whether the caller expects a reference to e.g. something
in svn_fs_t or an object with independent lifetime.
Complexity in (1) is mitigated by having a narrow interface (few
callers) and is justified by various advantages over the main
membuffer cache.
Problems from (2) should be solved as of r1648538. While DAG
nodes are still heavy objects, they have shrunk and flattened
since the 1.7 times when this code got written. Copying them
is no longer much of an overhead.
Issue (3) still stands. A possible solution has been reverted in
r1648538.
Number (4) is a temporary problem and possibly caused the
crash reported in the original post.
Tightening pool usage is worthwhile because there is almost
no FS API function thatis inherently trivial. They have to walk
path hierarchies, read files, retry on alternative code paths etc.
This ties back into issue (3) where traces of that work leak to
the API caller in the form of used pool memory. All the caching
makes it feel like constant memory consumption but it is probably
possible to allocate a few MB with a seemingly harmless sequence
of FS calls.
It strikes me that the root problem is amazingly similar to the issue we
> had with pool lifetimes and locking in the implementation of DB_REGISTER
> support in the BDB back-end (that one was my bug). At the time, I
> believe we found that no possible pool usage pattern could avoid bad
> interaction with reference-counted objects in pools, given the way pools
> currently work in APR.
>
> This is going to be a problem with any long-lived cache that depends on
> pools that are created outside our control, as is the case with
> HTTPd/mod_dav_svn, where pool lifetimes in general are completely
> unpredictable. Combined with unpredictable request patterns and even
> less predictable data patterns (since repository content is completely
> outside our control), this makes pools amazingly unsuitable for
> cross-request caches. Even the lifetime of global pools is unpredictable
> because we don't control global pool destruction order.
>
> Given that our API is what it is, I can see no alternative to fixing the
> way the caches work. Reference counting only works when you have
> complete control over the lifetime of both the cache and the referrer.
>
I'm not familiar with the DB_REGISTER issue but
I assume that it somehow bound to or extended the
effective lifetime of an svn_fs_t. The DAG node cache
is different in that references to it never need to survive
past the boundary of any FS API call. They are merely
there to allow the equivalent of
fs_api_func()
{
/* Careful: requesting one node must not invalidate the
respective other. */
do_func(get_node("A", lock-it), get_node("B", lock-it));
/* nobody cares about A and B any more at this point. */
}
Local sub-pools would safely limit the lock lifetime and
solve the problem of leaking implementation details.
-- Stefan^2.
Received on 2014-12-30 18:07:17 CET