On 10.02.2012 12:21, Branko Čibej wrote:
> On 07.02.2012 22:24, Stefan Fuhrmann wrote:
>> On 07.02.2012 00:41, Greg Stein wrote:
>>> In most data storage mechanisms for the repository, inheritable
>>> properties are a performance killer.
>> I'm not sure that this is actually applicable to SVN
>> for two reasons:
>> (1) we use deltification and
> I have absolutely no idea how deltification helps with inheritable
> properties.
Obviously. There are two important points to make here.
First, a system like Subversion *must* use some sort of
fragmented / multi-step data access. Indirect access to
properties is not something extraordinary here. Second,
access to large databases is dominated by their physical
organization. Details to both points below.
>> (2) we often handle whole file trees
> Neither here nor there.
On the contrary. This is an essential difference to e.g. NTFS.
Subversion reads individual nodes only very rarely while
most OSes can open single files only. Checking for props,
reading and finally evaluating them must be as fast as possible.
Inherited properties eliminate the need to read props on
most nodes (only checking that there is no local override).
Even the evaluation of e.g. inherited ACLs may be skipped
if the semantics has been chosen appropriately. This is a
perfect example how elimination of redundancy (e.g. by
"deltification") improves performance rather than incurring
a penalty.
> Inheritable properties would be /relatively/ less of a killer in SVN
> backends because we're already doing lookups the silly way, i.e., a
> lookup for /a/b/c will resolve and read /a and a/b while searching for
> .../c, so it's not much extra work to keep the current values of
> inheritable properties in the lookup context.
The silly part of FSFS is that it does not optimize access
paths, yet, but stores changes individually. The challenge
is our two-dimensional key space and the fact that different
operations traverse the data along different dimensions
(e.g. log ./. checkout).
With my latest commit, the caching code allows for more
or less O(1) access / O(n) traversal along these dimensions.
> A proper lookup would jump straight to /a/b/c without examining the
> intermediate directories, and /then/ it would have to climb back up the
> tree to find inheritable props (or ACLs, same difference in this case).
> For a real filesystem, that's definitely a performance killer, and the
> reason why NTFS fakes ACL inheritance. The assumption is that you'll be
> changing inheritable ACLs a lot less often than you will be reading
> them, so the storage/performance tradeoff is definitely worth it.
Question: how many entries would a direct lookup structure
need to have (i.e. path_at_rev -> data pointer)? Keep in
mind that may valid paths like /branches/foo/bar will never
be mentioned anywhere in a SVN repository because they
never got touched under that name. A rough estimation for
a fully expanded list of entries is
#nodesInTrunk * #openBranches * #revisions
This yields 10^9 entries for small repositories and >10^14
for KDE-sized ones. Clearly impractical.
Even NTFS does not attempt a direct mapping but uses
a tree structure and simply hopes to cache enough nodes
to make access performance acceptable. The differences
to FSFS are details of the tree representation.
> I suspect the situation in SVN FS is quite similar, and if we
> restructured the way the directory tree is represented to something
> similar to how WC-NG (or Mercurial) does it, these issues would suddenly
> become more important.
For the working copy, things are different because we
are more likely to access to single items and we need
to support data changes. The latter calls for more flexible /
generic data structures than the r/o data backend where
small size can be made to equal high performance.
Sorage / performance tradeoffs on the *client side* are
plausible, though.
-- Stefan^2.
Received on 2012-02-12 02:53:18 CET