Stefan Fuhrmann wrote on Fri, Apr 11, 2014 at 22:01:53 +0200:
> Now, if there were 50 requests for the *same* reconstructed data:
>
> file_t repo_files[50][20]
> for k in 0..49 parallel_do
> for i in 0..19 : repo_files[k][i].open("revs/$i")
> result[k] = ""
> for i in 0..19 : result[k].combine(repo_files[k][i].read()))
>
> Caches don't help if they don't contain the data:
>
> for k in 0..49 parallel_do
> result[k] = cache.lookup() // fails for all at the same time
> if result[k].missing then
> // moved sub-loops to sub-function for clarity
> result[k] = reconstruct(repo_files[k])
> cache.add(result[k])
>
> There are two major problems with that:
>
> (1) We process the same data 50 times while once would suffice.
> SVN-internal caches did not help; however the OS may only
> have read the data once and then fed us from disc cache.
In other words: the problem is not eviction out of the disk cache but
having to recreate the same fulltext many times.
> (2) We keep 1000 files open. On some systems, this may cause
> resource shortages.
It sounds like (2) is an independent problem: we open files in advance
of actually needing them. (i.e., something the branch doesn't/won't change)
> The ideal solution / control flow would look like this:
>
> for k = 0 do
> result[k] = reconstruct(repo_files[k])
> cache.add(result[k])
>
> for k in 1..49 parallel_do
> result[k] = cache.lookup()
Just to be clear, result[k] is some fulltext, right? (of a file or of a
directory)
> Since we don't (can't?) coordinate requests on a global level, this
> is what we do on the thunder branch:
I'm not sure why you say we can't do global coordination. We can use
memcached for the fulltext cache, can't we? The cache is a singleton,
so the cache's getter is a chokepoint that all threads will go through,
so couldn't we put the entire thunder logic there?
For example:
svn_error_t *
svn_cache__get_and_if_absent_populate(void **value,
/* svn_boolean_t *found, */
svn_cache__t *cache,
const void *key,
svn_error_t *(*factory)(void **value, void *baton, apr_pool_t *scratch_pool),
void *factory_baton,
apr_pool_t *result_pool,
apr_pool_t *scratch_pool);
where FACTORY is called if KEY is not in cache and either (a) the
current thread is the first in the thunder, or (b) the current thread is
not first in the thunder, but it has slept for X seconds and the cache
is unpopulated.
Basically, keep in libsvn_fs only the determination of _which_ cache
keys are going to be hot paths, and the implementation of the "thunder"
considerations (let's maybe sleep to give another thread a chance) in
svn_cache__*().
> for k in 0..49 parallel_do
> result[k] = cache.lookup()
> if result[k].missing then
>
> token = thunder.coordinate(data_location)
> if token.another_got_to_read then // all but the first
> result[k] = cache.lookup()
> if result[k].ok : jump done // >90% hit rate
>
> result[k] = reconstruct(repo_files[k])
> cache.add(result[k])
>
> thunder.completed(token)
> done
>
> So, there is no penalty on the hot path, i.e. when data can be found
> in the respective cache. The coordinating instance is also conceptually
> simple (keep a list of all accesses in flight) and the delay for the
> first thread is negligible. Concurrent threads reading the same location
> will be blocked until the initial thread completed its access. That
> minimizes the code churn on the calling side.
>
> A timeout prevents rouge threads from blocking the whole system. Also,
> entries that timed out will be removed from the access list. The rouge
> thread would have to start another relevant data access (and be the first)
> to block other threads for another time.
Hmm. The way I imagined it, the non-first threads just do:
if (I_am_first):
value = compute_value_of(key)
cache[key] = value
return value
else:
time.sleep(SLEEP_LENGTH)
try:
return cache[key]
except KeyError:
value = compute_value_of(key)
cache[key] = value
return value
That way, no thread ever calls sleep() more than once, so threads will
never be delayed by more than SLEEP_LENGTH.
(This should address Ivan's second concern.)
> My plan is to test the code on my server setup at home to get a more
> nuanced picture of what the performance and scalability impact is.
> On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
> and smooth handling of 200 concurrent c/o (vs running out of file handles
> at 100 clients) over ra_svn.
>
FWIW, what I wonder about is how well the change scales down (to small
setups) and how the code is going to look with the feature added.
> Should these trends be confirmed for "real" HW, networks and ra_serf,
> I'd love to see this code in 1.9 - after due review and feedback.
Having it in 1.9 if it's ready in time would be great. That said, we're
releasing an alpha tomorrow, and I'm honestly not sure when it would be
too late to add this. (That is, I'm not sure when we enter feature
freeze. We don't allow any new features after an RC; I'm not sure how
long before an RC a feature such as this one would have to be merged.)
Thanks for the explanation,
Daniel
Received on 2014-04-12 20:35:43 CEST