Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Fri, 11 Apr 2014 22:01:53 +0200

It seems that the document in notes/ did not make it clear what
the actual problem is and how it applies to Subversion servers.
Let me try to illustrate it.

Assume, we want to reconstruct a single file from the repository
as part of a single request, then this is what we effectively do
(omitting minor details):

    file_t repo_files[20]
    for i in 0..19 : repo_files[i].open("revs/$i")
    result = ""
    for i in 0..19 : result.combine(repo_files[i].read()))

Now, if there were 50 requests for the *same* reconstructed data:

    file_t repo_files[50][20]
    for k in 0..49 parallel_do
        for i in 0..19 : repo_files[k][i].open("revs/$i")
        result[k] = ""
        for i in 0..19 : result[k].combine(repo_files[k][i].read()))

Caches don't help if they don't contain the data:

    for k in 0..49 parallel_do
        result[k] = cache.lookup() // fails for all at the same time
        if result[k].missing then
                        // moved sub-loops to sub-function for clarity
            result[k] = reconstruct(repo_files[k])
            cache.add(result[k])

There are two major problems with that:

    (1) We process the same data 50 times while once would suffice.
        SVN-internal caches did not help; however the OS may only
        have read the data once and then fed us from disc cache.
    (2) We keep 1000 files open. On some systems, this may cause
        resource shortages.

How likely this the above scenario in SVN? An operation like
checkout may take many minutes to complete. The first client to
do the c/o will read data from disk and populate the server caches.
Any other client comming in later will be much faster since it gets
fed from cache.

If new c/o requests keep comming in before the first one completed,
those extra requests have a good chance of "catching up" for the
first one. In case like ra_svn that have a fully deterministic
reporting order, all requests have a chance to gang up to the "50
requests scenario" above. And they will do it over and over for
many files to come.

With ra_serf, things are slightly more subtle, iff the clients should
randomize their requests (not sure they do). For them, it is metadata
(revprop packs, indexes) and data layout (temporal locality being
corelated to spacial locality) that will see the issue - albeit
in a more distributed fashon (e.g. 10 locations with 5 readers each
instead of 1 with 50).

The ideal solution / control flow would look like this:

    for k = 0 do
        result[k] = reconstruct(repo_files[k])
        cache.add(result[k])

for k in 1..49 parallel_do
result[k] = cache.lookup()

Since we don't (can't?) coordinate requests on a global level, this
is what we do on the thunder branch:

    for k in 0..49 parallel_do
        result[k] = cache.lookup()
        if result[k].missing then

            token = thunder.coordinate(data_location)
            if token.another_got_to_read then // all but the first
        result[k] = cache.lookup()
        if result[k].ok : jump done // >90% hit rate

result[k] = reconstruct(repo_files[k])
cache.add(result[k])

thunder.completed(token)
done

So, there is no penalty on the hot path, i.e. when data can be found
in the respective cache. The coordinating instance is also conceptually
simple (keep a list of all accesses in flight) and the delay for the
first thread is negligible. Concurrent threads reading the same location
will be blocked until the initial thread completed its access. That
minimizes the code churn on the calling side.

A timeout prevents rouge threads from blocking the whole system. Also,
entries that timed out will be removed from the access list. The rouge
thread would have to start another relevant data access (and be the first)
to block other threads for another time.

My plan is to test the code on my server setup at home to get a more
nuanced picture of what the performance and scalability impact is.
On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
and smooth handling of 200 concurrent c/o (vs running out of file handles
at 100 clients) over ra_svn.

Should these trends be confirmed for "real" HW, networks and ra_serf,
I'd love to see this code in 1.9 - after due review and feedback.

-- Stefan^2.
Received on 2014-04-11 22:02:31 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]