Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Mon, 14 Apr 2014 18:26:41 +0200

On Sat, Apr 12, 2014 at 8:35 PM, Daniel Shahaf <d.s_at_daniel.shahaf.name>wrote:

>
>
> >
> >
> >
> >
> >
> >
> >
>
> >
> >
> >
> >
> >
> >
> >
> >
>
> >
> >
> >
> >
>
>
>
>
Stefan Fuhrmann wrote on Fri, Apr 11, 2014 at 22:01:53 +0200: > Now, if there were 50 requests for the *same* reconstructed data: file_t repo_files[50][20] for k in 0..49 parallel_do for i in 0..19 : repo_files[k][i].open("revs/$i") result[k] = "" for i in 0..19 : result[k].combine(repo_files[k][i].read())) > Caches don't help if they don't contain the data: for k in 0..49 parallel_do result[k] = cache.lookup() // fails for all at the same time if result[k].missing then // moved sub-loops to sub-function for clarity result[k] = reconstruct(repo_files[k]) cache.add(result[k]) > There are two major problems with that: (1) We process the same data 50 times while once would suffice. SVN-internal caches did not help; however the OS may only have read the data once and then fed us from disc cache. In other words: the problem is not eviction out of the disk cache but having to recreate the same fulltext many times.

Yes. It's about what has to happen before anything gets
*into* the cache(s) and to not waste effort doing it. The
principle applies to other data as well (revprop packs,
manifest / index files etc.).

> > (2) We keep 1000 files open. On some systems, this may cause
> > resource shortages.
>
> It sounds like (2) is an independent problem: we open files in advance
> of actually needing them. (i.e., something the branch doesn't/won't
> change)
>

Yes and no. It is quite hard to change our file usage
pattern as we have to keep files open in the event of
a background 'svnadmin pack' and the branch won't
change that.

However, having only one request opening a bunch
of files instead of many requests doing the same will
avoid the resource issues in many scenarios.

> > The ideal solution / control flow would look like this:
> >
> > for k = 0 do
> > result[k] = reconstruct(repo_files[k])
> > cache.add(result[k])
> >
> > for k in 1..49 parallel_do
> > result[k] = cache.lookup()
>
> Just to be clear, result[k] is some fulltext, right? (of a file or of a
> directory)
>

Yes. Or similar cachable object.

> > Since we don't (can't?) coordinate requests on a global level, this
> > is what we do on the thunder branch:
>
> I'm not sure why you say we can't do global coordination. We can use
> memcached for the fulltext cache, can't we?

What I was trying to say is that we can't / won't order the
requests in the sense that one becomes the dedicated
"scout thread" and everyone else would then be explicitly
fed from cache only.

Instead, any random one may request a given data first
and all others wanting the same data will wait for a short
period. Then, they check whether they can get the data
from cache and read it from disk if they can't.

So, all requests are still very much independent from
one another, our caching does not need to be perfect
and there are no timing constraints (cache data might
eventually get evicted etc.)

The cache is a singleton,
> so the cache's getter is a chokepoint that all threads will go through,
> so couldn't we put the entire thunder logic there?
>
> For example:
>
> svn_error_t *
> svn_cache__get_and_if_absent_populate(void **value,
> /* svn_boolean_t *found, */
> svn_cache__t *cache,
> const void *key,
> svn_error_t *(*factory)(void
> **value, void *baton, apr_pool_t *scratch_pool),
> void *factory_baton,
> apr_pool_t *result_pool,
> apr_pool_t *scratch_pool);
>
> where FACTORY is called if KEY is not in cache and either (a) the
> current thread is the first in the thunder, or (b) the current thread is
> not first in the thunder, but it has slept for X seconds and the cache
> is unpopulated.
>
> Basically, keep in libsvn_fs only the determination of _which_ cache
> keys are going to be hot paths, and the implementation of the "thunder"
> considerations (let's maybe sleep to give another thread a chance) in
> svn_cache__*().
>

Well, that would be one option. I feel +/-0 on the
cache interface change. I'll think about it.

Right now, the branch uses two wrapper functions
around the cache getters and achieve the same
functionality (e.g. svn_fs_fs__thundered_cache_get).

> > for k in 0..49 parallel_do
> > result[k] = cache.lookup()
> > if result[k].missing then
> >
> > token = thunder.coordinate(data_location)
> > if token.another_got_to_read then // all but the first
> > result[k] = cache.lookup()
> > if result[k].ok : jump done // >90% hit rate
> >
> > result[k] = reconstruct(repo_files[k])
> > cache.add(result[k])
> >
> > thunder.completed(token)
> > done
> >
> > So, there is no penalty on the hot path, i.e. when data can be found
> > in the respective cache. The coordinating instance is also conceptually
> > simple (keep a list of all accesses in flight) and the delay for the
> > first thread is negligible. Concurrent threads reading the same location
> > will be blocked until the initial thread completed its access. That
> > minimizes the code churn on the calling side.
> >
> > A timeout prevents rouge threads from blocking the whole system. Also,
> > entries that timed out will be removed from the access list. The rouge
> > thread would have to start another relevant data access (and be the
> first)
> > to block other threads for another time.
>
> Hmm. The way I imagined it, the non-first threads just do:
>
> if (I_am_first):
> value = compute_value_of(key)
> cache[key] = value
> return value
> else:
> time.sleep(SLEEP_LENGTH)
> try:
> return cache[key]
> except KeyError:
> value = compute_value_of(key)
> cache[key] = value
> return value
>
> That way, no thread ever calls sleep() more than once, so threads will
> never be delayed by more than SLEEP_LENGTH.
>
> (This should address Ivan's second concern.)
>

Well, this is roughly equivalent to what I implemented initially.
The difference was that I would use a timed wait instead of
sleep such that the waiting threads would resume a.s.a.p.

Testing showed that all this, combined with the various mutex
ops around it, created a 20% overhead and some hard-to-interpret
CPU load stats. So, I changed it to simply using sleep() in
r1586964 and removed all the bits that weren't necessary anymore.
Now, things seem to scale nicely in many scenarios with extra
clients increasing the processing time only slightly.

> > My plan is to test the code on my server setup at home to get a more
> > nuanced picture of what the performance and scalability impact is.
> > On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
> > and smooth handling of 200 concurrent c/o (vs running out of file handles
> > at 100 clients) over ra_svn.
> >
>
> FWIW, what I wonder about is how well the change scales down (to small
> setups) and how the code is going to look with the feature added.
>

It's all implemented and ready for review now.

So far, I haven't seen any measurable impact on single
request scenarios. It is clear that small, concurrent requests
may see an extra delay do to the sleep. But since we sleep
only if I/O (or at least OS interaction) is necessary and only
for 250ms max, the added user-visible delay will be short
and, in most scenarios, infrequent.

> > Should these trends be confirmed for "real" HW, networks and ra_serf,
> > I'd love to see this code in 1.9 - after due review and feedback.
>
> Having it in 1.9 if it's ready in time would be great. That said, we're
> releasing an alpha tomorrow, and I'm honestly not sure when it would be
> too late to add this. (That is, I'm not sure when we enter feature
> freeze. We don't allow any new features after an RC; I'm not sure how
> long before an RC a feature such as this one would have to be merged.)
>

Well, I think the code is relatively simple and robust.
The interaction with the existing code is also clear.

I've been testing the code in those scenarios that sparked
me into attempting this solution and things look promising.
But it still needs broader testing and I want to be able to
fully explain all the data that I see. In the end, I need to be
able to justify the release in 1.9. Otherwise, it would probably
go into 1.10. I'll continue work on that once I come back after
Easter.

Until then, review and feedback would be very much appreciated.

-- Stefan^2.
Received on 2014-04-14 18:28:05 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]