On 9/26/13 2:34 AM, Stefan Sperling wrote:
> This sounds as if the performance issue was on the server side.
I ended up helping answer some of the questions behind this particular issue
and you are entirely correct.
> The client requests a list of changed paths if 'svn log' is run
> with the --verbose option (some GUI clients do this by default).
> Adding an option to the client that filters out some data before
> displaying it won't help at all because the server will still perform
> work to obtain and send the changed paths list anyway.
Really this isn't what's driving the problem. Coming up with the list of paths
is not particularly difficult work.
> The real fix for this issue would be scalability improvements on the server.
> That implies finding the actual bottleneck. If they have authz enabled
> and the bottleneck is checking authz rules against 300k paths, then
> there is not much we can do about this because every path needs to
> be matched to enforce access rules. Some performance penalty is expected
> and documented in this case.
> See http://svnbook.red-bean.com/en/1.7/svn.serverconfig.pathbasedauthz.html
> ("in certain situations, there's very noticeable performance loss")
> They can measure the authz performance penalty buy temporarily disabling
> authz and running the log request without it.
>
> I know of a case where someone imported close to a million paths in
> revision 1, and when they run 'svn log -v' against the authz-enabled
> server it takes forever to gather log information for revision 1.
> All other revisions which changed much fewer paths are fine, and
> the problem is mitigated if authz is disabled.
Like I said above -v isn't particularly problematic. The real problem as you
suspected was the authz process.
There are basically 3 possible states of access when processing a revision for log.
1) Client has full access to all the paths changed in the revision. We grant
access to all revision properties, which for the most part means you can get
svn:log in addition to other revision properties like svn:date and svn:author.
We also provide all the changed paths (in the case of the -v option).
2) Client has access to some paths but not all that have been changed in the
revision. We only provide svn:date and svn:author revision properties. We
elide all other revision properties because they may contain information about
like the identify of the files changed the user does not have access to. Our
own format for log messages was part of the driver behind this behavior.
Changed paths (again -v) are provided only for the paths they have access to.
3) Client has no access to any paths changed by the revision. We do not
provide any information about the revision.
Due to the above behavior if authz is enabled we always have to calculate the
changed paths for the log. So the only difference in performance between a -v
request or one with it is serializing the paths and sending them across the
wire. Obviously, if no authz is enabled -v can change if we need to come up
with changed paths and thus log without -v can be more efficient.
Additionally, in your million path import case, even doing the log against a
subset of the tree that has a much smaller number of files won't help. Since
the authz process has to use all the changed paths not just the subset the user
wants to see to determine if the user has access to all the revision properties.
All of this comes down to the desire to have some paths in the repository
completely hidden, not just the contents, but even the names of the paths. As
Stefan Fuhrmann pointed out in our WANdisco discussion on this there is good
reason for desiring to do this. Say you're developing a presentation about a
new client that you want to keep quiet. You may name your presentation
"name_of_client.pptx" and revealing that would reveal some very important and
confidential information.
Right now we provide absolutely no way to configure this behavior. You can't
decide if you care about hiding paths. If you don't care about hiding path
names then you're paying a rather hefty performance cost to do so with no real
benefit.
Consider typical unix permissions of read, write, and execute. Execute on a
directory determines if you can see the children of the directory. We have no
equivalent in authz. Instead we only have read and write permissions, with
read on the file or directory providing the ability to see that path. Having
something along this lines would help some because we could walk down to
directories and not have to check every file path. At least in so much as most
repositories are probably going to have more files than they have directories.
However, I don't think adding a new permission is something we should undertake
with the current authz system. I'd imagine we'd have to implement a new authz
format in order to determine when to apply this sort of behavior so that
existing authz files are not broken. We also have other projects like viewvc
that know how to read our authz files, so making a change bubbles out beyond
just our code base. I'd much rather see something like this dealt with by some
future ACL in the repo system that replaces our hacked on authz system.
In the meantime perhaps we should add a configuration to our servers to let
administrators decide if they want this level of restrictive behavior. In this
particular case Medtronic didn't care about protecting path names. Only the
content of certain files. However, I'm concerned that adding such a flag is
very dangerous from a security perspective because then preventing unintended
leaks is on every user to ensure confidential information is not placed in
their directory names, file names or commit logs. I think that's an awful lot
of responsibility to push off onto users. But maybe we can put enough
documentation and warnings around such an option to mitigate those concerns.
> Perhaps your customers are running into another bottle neck, but I think
> that's unlikely. But in this case tuning server-side caching (with
> Subversion 1.7 and up on the server) might mitigate the issue somewhat.
Sadly caching as currently implemented won't really help much since the problem
is driven by the authz process as described above.
Implementing some caching into the authz process would help. We've had some
discussions about caching authz data, unfortunately though the discussions
we've had wouldn't really help this because the log authz processing is not
slow due to many duplicate paths being checked in the same connection, but
rather because of the quantity of paths that need to be checked. So any
caching solution would have to be more along the lines of a per process caching
like we have implemented for filesystem data.
Received on 2013-09-27 00:12:41 CEST