On 12.02.2012 06:27, Branko Čibej wrote:
> On 12.02.2012 02:52, Stefan Fuhrmann wrote:
>
>> The silly part of FSFS is that it does not optimize access
>> paths, yet, but stores changes individually. The challenge
>> is our two-dimensional key space and the fact that different
>> operations traverse the data along different dimensions
>> (e.g. log ./. checkout).
> Interestingly enough, the 2D keyspace isn't that big a problem.
On the physical level, it becomes a problem as files
model a linear data stream, i.e. you have to find a
mapping onto a 1-dimensional structure that keeps
related / dependent information closely together.
> The real
> issue is that we don't even represent all the relevant keys, because of
> the lazy copying of subtrees. That's what actually prevents us from
> doing one-shot lookups of arbitrary path_at_rev; and even then, we'd only
> really have to do a step-by-step top-down resolve if the initial fast
> lookup failed.
How do you verify that path_at_rev even exists?
For instance, you could maintain an index of (only)
every path ever touched. As long as we want to keep
our O(1) delete operations for paths, entries in that
index cannot be updated / removed. So, we need
to always verify that there is actually a path from /@rev
to path_at_rev.
>> Question: how many entries would a direct lookup structure
>> need to have (i.e. path_at_rev -> data pointer)? Keep in
>> mind that may valid paths like /branches/foo/bar will never
>> be mentioned anywhere in a SVN repository because they
>> never got touched under that name. A rough estimation for
>> a fully expanded list of entries is
>>
>> #nodesInTrunk * #openBranches * #revisions
>>
>> This yields 10^9 entries for small repositories and>10^14
>> for KDE-sized ones. Clearly impractical.
> So that's not how you do it. :)
>
> You'd only need one reference per actual node, not per possible node
> lookup paths including revisions. Obviously you have to resolve path_at_rev
> to a concrete node before you can do anything with its attributes. With
> the caveat that there's a nasty edge case involving not-yet-lazy-copied
> child nodes; but given the way we currently crawl the tree, that's not
> really an issue.
We simply follow the tree, i.e. one lookup per parent.
With any luck most of the intermediate nodes have
already been cached, which is exactly what makes our
differential trees so efficient: as long as a sub-tree is
not being modified, all copies and future revisions
share the *same* representation.
But maybe we are simply talking past each other here.
So, could you give a short summary of what you consider
wrong / inefficient about SVN / FSFS and how you would
like to see this addressed.
-- Stefan^2.
Received on 2012-02-12 13:58:00 CET