One of WANdisco's customers brought the performance of "svn lock *" to
my attention. There is a network issue over HTTP and a filesystem issue
in the repository.
When locking multiple files over HTTP the client sends a separate LOCK
request for each file and the round trip delays add up. Also the
bandwidth overhead, all the HTTP headers, is high: using one request per
path it is not an efficient way to transport paths.
On the repository side creating or removing a lock involves writing an
index file for each parent directory in addition to handling the lock
file itself. To lock N separate files in a directory '/A/B/C' involves
writing N times the index files for '/', '/A', '/A/B' and '/A/B/C' as
well as handling the N lock files. To lock N files at depth D we do
O(N*D) writes but only modify O(N+D) distinct files; it doesn't scale
very well.
We already pass multiple paths into svn_ra_lock so we could address part
of the network problem by rewriting some serf code to make it pipeline
the LOCK requests. That would have the advantage of working with older
servers but to solve all the problems we need to make HTTP more like the
svn protocol: send a single request (perhaps POST instead of LOCK?) for
the repository root and pass all the paths in the body of the request.
Once we have all the paths arriving at the server in one request we can
add new FS APIs to lock/unlock multiple paths, then sort the paths and
write each index file only once.
I'm not quite sure how hooks would behave. We would need to run all the
pre-lock hooks first and some could fail. We could drop the paths that
fail and pass the rest to the FS layer, or perhaps fail the whole
operation if any pre-lock fails. Either way the FS layer may fail to
lock some of the paths for various reasons (non-existant, already
locked, etc.) and so the final set of locks could be smaller than the
set of paths. Finally we run post-lock for the subset of paths that are
locked.
There is also an FS atomic issue to consider. The current single path
API can be interrupted between writing index files and handling the lock
file, but the result is the single path is locked or unlocked with any
"broken" index files being invisible to the user (athough the "broken"
index files may cause more work for the server). A multiple path API
could result in user visible changes to a subset of the paths. I think
that would be OK but it needs a bit more thought.
I've also noticed that we don't fsync any files when writing locks into
the repository. I'm not sure if this is deliberate or not but if we
were to start calling fsync then the filesystem issue would become more
important.
--
Philip Martin | Subversion Committer
WANdisco // *Non-Stop Data*
Received on 2013-12-17 13:49:21 CET