On 12.02.2012 17:58, Daniel Shahaf wrote:
> Stefan Fuhrmann wrote on Sun, Feb 12, 2012 at 13:57:23 +0100:
>> On 12.02.2012 06:27, Branko Čibej wrote:
>>> On 12.02.2012 02:52, Stefan Fuhrmann wrote:
>>>
>>>> The silly part of FSFS is that it does not optimize access
>>>> paths, yet, but stores changes individually. The challenge
>>>> is our two-dimensional key space and the fact that different
>>>> operations traverse the data along different dimensions
>>>> (e.g. log ./. checkout).
>>> Interestingly enough, the 2D keyspace isn't that big a problem.
>> On the physical level, it becomes a problem as files
>> model a linear data stream, i.e. you have to find a
>> mapping onto a 1-dimensional structure that keeps
>> related / dependent information closely together.
>>
> So perhaps solve _that_ problem?
Mercurial has solved that problem ages ago. Their revlog files are
append-only, and their versioned tree info is stored in a revlog file.
Mercurial also happens to /not/ do lazy copying of the tree, however,
which simplifies the index enormously. Instead, the structure of the
directory index is designed to make even whole-tree change
representations reasonably compact.
(This is where one could write an essay titled "lazy copying considered
harmful" :)
The trouble about Subversion's representation of the versioned tree is
that it considers lazy copying to apply to whole nodes, and on the
logical (though not physical) level treats the node's name as an
inherent propery of said node, at least as far as lazy copying is
concerned, so we don't have a complete path_at_rev index for relevant
revisions.
The (invalid) assumption was that if we did have concrete index entries
for each path_at_rev, the index would grow too quickly and become
unmanageable. And the result is that, instead, you have to jump through
hoops every time you resolve a path_at_rev, and even bigger hoops when you
modify a child of a copied node, since you have to figure out how to
populate the directory index every time you do that instead of doing it
once during the initial copy.
I guess what I'm proposing is that we decouple the path_at_rev index from
the actual file contents data; currently directory information is stored
in the directory node's contents. Instead, we should keep the
path_at_rev->node index completely separate. That way we would probably end
up not having directory nodes at all, since the only interesting bit of
information that a directory node would contain are the properties, and
if we want to implement any kind of efficient inheritable property
scheme, those props would have to become indexed separately, if not
independently, of the node contents.
To illustrate: Currently our tree structure looks like this:
+------------+
| node (dir) |
+--+---------+--+ +-------------+
| contents-> | ---> | file->, ... |
| props | +-------------+
+------------+ |
v
+-------------+
| node (file) |
+--+----------+-+ +----------------+
| contents-> | ---> | representation |
| props | +----------------+
+------------+
What I'm proposing would essentially look like this:
Index Contents
+--------------+------------+ +---------------+
| dir_at_rev | props-> | -+-> | property, ... |
+--------------+------------+ | +---------------+
| dir/file_at_rev | props-> | / +----------------+
| | contents-> | ---> | representation |
+--------------+------------+ +----------------+
(That this example shows one possible representation of inheritable
properties, although it's a bit too simplistic to be useful in an actual
implementation.)
The idea is that we'd always maintain the complete index, i.e., in order
to determine if path_at_15 exists, one only needs to search the index for
(path, rev <= 15, !deleted) -- which is trivial in a properly ordered
index. (Yes, this assumes that we record deletions in the index as well
as insertions.)
All of this can be done with an append-only index representation. What
you can't do with append-only is represent forward history links, but --
we're not representing them very well right now, are we. :)
The other issue with an append-only index is that you essentially have
to reconstruct the whole index before you can do a lookup. A much better
structure for representing such (compressed, ordered) indexes has been
known for a while now, it's called a B-tree. ...
-- Brane
Received on 2012-02-12 21:52:58 CET