Re: [PROPOSAL] WC-NG: merge NODE_DATA, WORKING_NODE and BASE_NODE into a single table (NODES)

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Thu, 02 Sep 2010 23:39:19 +0100

Erik Huelsmann wrote:
> As described by Julian earlier this month, Julian, Philip and I observed
> that the BASE_NODE, WORKING_NODE and NODE_DATA tables have many fields in
> common. Notably, by introducing the NODE_DATA table, most fields from
> BASE_NODE and WORKING_NODE already moved to a common table.
>
> The remaining fields (after switching to NODE_DATA *and* SINGLE-DB) on the
> side of WORKING_NODE are the 2 cache fields 'translated_size' and
> 'last_mod_time'. Apart from those two, there are the indexing fields wc_id,
> local_relpath and parent_relpath.
>
> In the end we're storing *lots* of bytes (wc_id, local_relpath and
> parent_relpath) to store 2 64-bit values.
>
> On the side of BASE_NODE, we end up storing dav_cache, repos_id, repos_path
> and revision. The NODE_DATA table already has the fields original_repos_id,
> original_repos_path and original_revision. When op_depth == 0, these are
> guaranteed to be empty (null), since they are for working nodes with
> copy/move source information. Renaming the three fields in NODE_DATA to
> repos_id, repos_path and revision, generalizing their use to include
> op_depth == 0 [ofcourse nicely documented in the table docs], BASE_NODE
> would be reduced to a store of the dav_cache, translated_size and
> last_mod_time fields.
>
> By subsuming translated_size and last_mod_time into NODE_DATA, neither
> WORKING_NODE nor BASE_NODE will need to store these values anymore. This
> eliminates the entire reason of existence of WORKING_NODE. BASE_NODE then
> only stores dav_cache. Here too, it's probably more efficient (in size) to
> store dav_cache in NODE_DATA to prevent repeated storage of wc_id,
> local_relpath and parent_relpath in BASE_NODE.
>
> In addition to the eliminated storage overhead, we'd be making things a
> little less complex for ourselves: UPDATE, INSERT and DELETE queries would
> be operating only on a single table, removing the need to split updates
> across multiple statements.
>
>
> This week, I was discussing this change with Greg on IRC. We both have the
> feeling this should work out well. The proposal here is to switch
> (WORKING_NODE, NODE_DATA, BASE_NODE) into a single table --> NODES.
>
>
> Comments? Fears? Enhancements?

+1.

It would be useful if you could post the latest version of the
description of the new format. Here's a bit of introductory text I
wrote, starting with a paragraph of yours from wc-metadata.sql:

/* The NODE_DATA table describes the way working nodes are layered on top of
base nodes and on top of other working nodes, due to nested tree structure
changes. The layers are modelled using the "op_depth" column.

   An "operation depth" refers to the number of directory levels down from
   the WC root at which a tree-change operation (delete, add, copy, move,
   replace) was performed. It does NOT refer to the number of path
   components in a node's own 'local_relpath', but rather to the depth of
   one of the tree changes that affects that node.

   The tree checked out of the repository and modified by "update", "switch"
   and "commit" post-processing, is represented by rows with op_depth=0.
   That "layer" of the NODE_DATA table corresponds to what was called the
   BASE_NODE table.

   If the WC root path is "." and already contains "./A/", and a directory
   tree "^/B" in the repository is copied to "./A/B", then rows are created
   for "./A/B" and for all the children beneath it, all with op_depth=2.

   Each path in the WC has one or more rows, each at a different "op_depth",
   depending on how many nested tree changes affect it. Rows also exist for
   paths that are not a currently visible part of the WC but were a part of
   one of the tree changes.
*/

Unifying the node-rev reference columns (repos_id, repos_relpath,
revision) across all layers makes perfect sense to me. Instead of
thinking of "based on this node-rev" in BASE being different from
"copied from this node-rev" in WORKING, we will treat both of them as
"the node in this layer reflects (is a copy of) this repository
node-rev".

As for the three columns holding cached values, it seems a bit impure
but pragmatically OK to move them into NODES, and certainly it would be
unhelpful to keep the existing BASE_NODE and WORKING_NODE tables with
their present names and only those columns in them.

We need to describe how the layering works for copies, deletes, and
adds. In particular I'm recalling something about how local adds aren't
recursive, unlike copies, so an additional change within an added dir
doesn't work the same way with regard to op_depth as it would inside a
copied dir.

- Julian
Received on 2010-09-03 00:40:03 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]