[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FS Node API

From: Stefan Fuhrmann <stefan2_at_apache.org>
Date: Mon, 11 Jan 2016 16:31:54 +0100

On 09.01.2016 14:39, Ivan Zhakov wrote:
> On 9 January 2016 at 03:48, Stefan Fuhrmann <stefan2_at_apache.org> wrote:
>>
>> Hi Ivan & interested parties,
>>
> Hi Stefan!
>
> See my response inline.
>
>>
>> * BRANCH-README: "The branch is NOT going to be merged to
>> trunk; I'm going to re-implement everything in trunk ..."
>>
>> - Why implementing it a second time (assuming the new
>> API turns out to be useful)?
>> - Do you expect the code to be "too prototypic" to be
>> used for production? If so, any specific indication
>> for that?
>> - I think having the feature developed on a branch makes
>> sense because the switch-over to the new API will be
>> very noisy. Otherwise, it could be done on /trunk
>> directly - and be reverted again if necessary.
>>
> I don't like developing new features in branches, so I consider this
> branch as experiment to explore the FS node API idea and make
> important decisions. It's not a big problem to re-implement all this
> on trunk using multiple incremental changes with detailed log messages
> and proper docstrings.

Fair enough, I'll leave you and Evgeny to it until you
deem the new API ready for review (to decide whether
to go on /trunk with it or not).

Make sure, though, that you actually get data for the
interesting points like ease of use, effort to update
callers, efficiency with legacy backends and callers
as well as the implementation effort. We already know
that a node API is possible (backends use DAG nodes
internally, which are effectively svn_fs_node_t) and
that it should be more efficient than the current code.

>
>> * svn_fs.h
>>
>> - Maybe, move the new API functions to new section.
>> They will not always be 1:1 replacements.
>>
> Sure, but for simplicity I just revving existing functions to accept
> svn_fs_node_t instead ROOT+PATH.

Not against revving the API, I simply suggest putting
the new API functions into common section within svn_fs_t.
That way, they will be easier to review and check for
further improvements.

>> * "svn_fs_node_t does not depends on svn_fs_root object created from"
>>
>> - I'm fine with that. Any specific reason behind this
>> other than that node_t should be be self-contained
>> within fs_t and result_pool?
>>
> libsvn_repos/reporter.c open and closes svn_fs_root_t dynamically and
> it will be very challenging to rewrite to maintain dependency between
> svn_fs_root_t and svn_fs_node_t.

That is a major problem. Please make sure that
opening and closing root objects for each and
every FS API access does not render BDB unusable.

While root_t is a cheap and static object in FSFS
and FSX (even for txns), this is not the case for
BDB.

 From a "what does this mean to future backends?"
perspective, I'd be willing to say that a backend
needs to do any expensive and resource constrained
construction work in the fs_t constructor. root_t
must never be resource constrained within reason
and must be easy to construct.

>> * "svn_fs_node_t always reference existing node in tree (?)"
>>
>> - Within a transaction, nodes of type svn_node_none
>> make sense, e.g. after a deletion. Footnote: in-txn
>> nodes do have a number of update / invalidation issues,
>> which makes having separate vtables instances for
>> in-txn and in-rev nodes a good idea.
>> - I'm not too sure we should allow creating nodes for
>> any random path but I guess we must at least have
>> the option to do so b/c of symmetry.
>>
>
> There are two possible design approaches:
> 1. svn_fs_node_t is just combination of ROOT+PATH and may point to
> some never existent node.
> 2. svn_fs_node_t is always point to some existing node in tree. Yes,
> we may end up with dangling node in case of TXN root nodes, but it's
> out of scope.
>
> I decided to go with (2) at this time.

The problem with (2) that txn nodes have a much
more dynamic behavior than the static nodes in
revisions. For instance, node.set_prop will
invalidate the ID that you might store internally.

IOW, you will always have to deal with dynamic
state for txn nodes and simply invalidating upon
deletion does not simplify anything (no better
guarantees, no simpler implementation, no simpler
usage). It also make it hard to implement the
equivalent of svn_fs_check_path.

>> * svn_fs_open_node
>>
>> - Could we name this svn_fs_node or svn_fs_get_node?
>> "open" suggests that there are resources involved
>> which require an implicit and sometimes explicit
>> "close".
>>
> This is intentional name: svn_fs_open_node() actually do some work to
> check whether node exists.

If your only point here is "use a heavy name for
a heavy operation", please reconsider. I'm mainly
opposed to the "open" part because it implies
resource allocation that requires explicit management.

I'd prefer svn_fs_node (similar to svn_fs_txn_root)
for shortness but something like svn_fs_get_node
would be fine, too.

>> * svn_fs_file_length
>>
>> - I was thinking of naming that svn_fs_node_length.
>> It could error out on non-files as it does today
>> or return some informative value like the representation
>> size. Caller logic needs to handle the "not a file"
>> case anyway, so it could use the new info on its own
>> discretion.
>>
> Makes sense, but for the branch I wanted to keep API like similar as
> possible for easy switching callers.

I fully agree that switching existing callers should
be made easy.

Since we have to change the function name and signature,
all we can keep is semantics. What I am proposing is
to extend API semantics where feasible such that the
old is a strict sub-set of the new one. So, old callers
can simply use the new API but can also be simplified
to no longer care about the distinction between file
and directory nodes.

>> * svn_fs_dir_entries2
>>
>> - Again, this could simply be svn_fs_sub_nodes.
> I kept existing semantic just to making porting to new API easier. We
> can discuss this improvement later.
>
>> - The result should be an array of node_t *, sorted
>> by their names. This is what the storage layer
>> delivers in FS* and it is what the repos layer
>> would like to scan during reporting.
> I think it's completely ortogonal to FS node API, so decided to move
> this task out of scope of the branch. Btw current update reporter
> implementation uses hashes very intensively.

It is o.k. to say that it's outside the scope of the
branch but it is clearly an important part of an
improved API. After all, the whole point of the new
API is efficient tree traversal.

Constructing the hash is not cheap, neither in time
nor space, and it has additional overhead on the
user side. The only place where the update reporter
even uses O(1) hash lookups is to determine the
diff between two directories. Doing this with two
sorted arrays is just as simple - but faster.

>> Additional Goals
>> ----------------
>>
>> * Use 2-pool paradigm.
>>
> Yes, I already using 2-pool paradigm in all revv functions.

r1723900 does not.

My point here is that using the 2-pool paradigm should
be an expressed goal of the new API. Don't care too
much about what the branch does ATM.

>> * Where appropriate, make the API more orthogonal and
>> consistent (e.g. svn_fs_node_make instead of
>> svn_fs_node_make_file and svn_fs_node_make_dir).
>>
> I think it's of scope at least for experimental branch. Personally I'm
> fine with current naming and would like to avoid just flavoring
> changes.

What are "flavoring changes"?

>> * Txn nodes and revision nodes should use 2 different
>> instances of the vtable. Error handling for operations
>> that either apply to txn nodes or revision nodes only
>> can then be moved to lib_fs by checking for NULL pointers
>> in the vtable.
>>
> What is purpose of different vtables for txn and revision nodes?

First of all, I assume that the new node API needs to
provide modifying functions for txn nodes. Otherwise,
transaction code needs to switch back and forth between
new style and old-style API.

The current implementation often has the form of

   if (!is-txn-root(root))
     error;
   do_something();

A separate vtable avoids the overhead and allows for
the error handling to be done generically in lib_fs.
Moreover, txn code will need to account for the dynamic
ID mapping making it a wrapper around the actual getter:

   node_get_x(node):
     return node.noderev.x;

   node_get_x_dynamic(node):
     update_noderev(node) // a no-op once node has been
                              touched in this transaction
     return node_get_x(node)

Making this distinctions statically in structure instead
of dynamically in code is the whole point of vtables.

>> * Ensure semantic equivalence to old API and high test
>> coverage from the start
>> - Implement node_compat in terms of the old backend API
>> - Implement the old FS API in terms of the new API,
>> i.e. route all calls through node_compat.*
>> - Have #define that controls whether the old API is run
>> natively or gets diverted through node_compat.*
>> - If the backend vtable entry is NULL for an old API
>> function, divert it through node_compat.*
> I've already added '#if 0' code for that. See svn_fs_open_node().

Do you also plan to do the full double redirection
old API -> new API(compat) -> old API vtable?

That will give you full test coverage even before
touching any API callers.

>> * Implement the node API in FSFS
>> - Take node_compat and only replace the vtable calls
>> with direct function calls.
>> - Add direct addressing info (noderev IDs) to the structs.
> I don't understand this point. Why we cannot use pointer to dag_node_t
> for svn_fs_node_t implemention in FSFS?

Sorry, I might have been terse here. "direct addressing"
was meant in contrast to simply using root,path internally
as proposed for the very first step.

svn_fs_node_t should supersede dag_node_t basically
everywhere because it *is* the DAG node but with a
much more space-efficient representation - if done
right. Remember, DAG nodes a very heavy structures.

The more interesting question is what you are going
to do with operations that need DAG paths (parent_path_t).

>> * Final review and merger into /trunk
>> - Report processing as in "diff --summary" is a good
>> performance indicator.
> My first goal is to switch 'svn ls -v' to new API and see how it works.

You won't learn much there because that operation is
constrained by revprop and change list reporting
performance.

To simply see whether your code works, follow the
double redirection approach. It requires much less
code and gives better coverage.

-- Stefan^2.
Received on 2016-01-11 16:31:58 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.