[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Thu, 13 Sep 2012 23:27:20 +0100 (BST)

Bert Huijben wrote:

>> From: Julian Foad [mailto:julianfoad_at_btopenworld.com]
>>
>> This branch caches properties of certain repository nodes, in a new place
>> in the WC DB.
>>
>> I haven't examined it at all yet.  The Wiki page says: "The cache will be
>> stored in a new BLOB column in the NODES table called 'inherited_props'. 
>> If the node is the base node of a WC root then this column is populated
>> with a serialized skel of the node's inherited properties.  In all other
>> cases it is NULL."
>>
>> Another new special column in the NODES table... the twenty-third
>> column.  Do we have to?  Is it not complex enough already?
>>
>> I want to make a radical suggestion.
>>
>> What would it take to be able to share some data instead of creating a
>> special-purpose column just for this cache?  Well, it would take more work
>> initially, but hear me out, as I have a hunch it may make less work in the
>> end.
>>
>> I think we would do well to factor out the storage of all data about a
>> repository node-rev, moving that into a separate table that is
>> *referenced* from the NODES table.  That new table would be indexed by
>> node-rev (that is: repos_id, rev,
>> relpath) and have a column for the properties and perhaps other columns
>> for other attributes of the node-rev.
>
> That would be two additional table lookups per retrieval inside Sqlite3,
> [...] Have you investigated the performance costs of this proposal?

No.

> I heard similar suggestions about properties before, but additional table
> lookups are certainly not free in Sqlite. They are not expensive by itself,
> but every extra db transaction is certainly a performance hit.
>
>> What we need to know here (for inherited props) is the properties set on
>> each ancestor of ever WC "root node" (which is defined elsewhere and
>> includes switched nodes etc.)  That's not a complex requirement.  Instead
>> of creating a new column dedicated to storing the union of all the props
>> inherited from every ancestor node, how about we fetch and concatenate,
>> on request, the node-rev table rows for all the pathwise ancestors of the
>> desired node.  The cache maintenance then becomes ensuring that a set of
>> node-rev table rows are populated (populated in the standard way for such
>> rows), and that those rows are removed when no longer referenced.
>>
>> But this particular use is not the only reason for factoring out this data
>> into a separate table.  There is:
>>
>>   - For inherited props, this use.
>>
>>   - For inherited props, if the current model of BASE node inheritance
>> remains, we're also going to need to cache the inherited props for every
>> mixed-rev node (that is, every node whose rev != its parent-in-BASE
>> rev).  (I'll write separately about why we need this for off-line
>> operation.)
>
> I think this case was explicitly ignored.
>
> This would require that operations like commit start updating parts of the
> working copy with information that is not available locally.

Yes.  That's not a problem, is it?

> Normally the commit shouldn't be able to introduce invalid inherited state
> (or the commit would fail).

Pardon?  I don't understand what you're saying there.

> And an explicit update of a subtree already adds the necessary information
> to the root (if the current state of the code still matches what I read
> before)
>
>>   - We already have (I think) the case where a local WC-to-WC copy or move
>> of a tree results in poulating another set of tree nodes in the NODES table
>> with an identical copy of all the data.  Whenever we can reduce the amount
>> of such copying, if the additional complexity is not too great, that can
>> make it easier to achieve both correctness and efficiency.
>
> When we copy from BASE to another layer we don't have to copy the inherited
> properties. The inheritance in the new location will be based on the new
> url, not the original location. So the current queries already handle this
> case by not copying the data from the BASE layer.

I'm talking there about regular, explicit properties, not inherited properties.

>>   - For conflicts, it's useful to store the "theirs" and "old base"
>> versions of the conflicted node.  Its kind, its properties, the reference
>> to its (pristine) text, and perhaps other attributes.
>
> On trunk everything is prepared to store conflicts in a skel, what has this
> to do with the inherited property storage?

It has nothing to do with inherited properties except that inherited properties and conflicts could both refer to this table of cached node-revs.

> In format 30 we already have the urls and properties of the conflicted nodes
> (not the pristine text yet, but with the url we can retrieve that or we can
> implement storing the sha1 references in the separate columns as initially
> planned)

If we go this way, then I would suggest we migrate the conflict storage skel to use (repos_id, rev, relpath) references to node-rev rows instead of storing the props and other bits of metadata explicitly.

>>   - I think it would make the WC *much* more maintainable.  At the moment,
>> there doesn't appear to be any codified connection between those eight
>> columns in NODES.  There doesn't even seem to be any comment in the doc
>> strings to indicate that.  It's almost impossible to analyze the huge
>> query statements to check that they're all updated together.
>>
>> Therefore I propose a WC development that adds a new table (call it
>> NODEREV?), indexed by repos/rev/relpath, with eight further columns that
>> will be moved out of the current NODES table:
>>
>>   repos_id, revision, repos_path -- the index columns
>>   kind
>>   properties
>>   checksum
>>   symlink_target
>>   changed_revision
>>   changed_date
>>   changed_author
>>   translated_size
>
> (translated_size or really recorded_size shouldn't be in this list as it can
> have a different value for the same file if it is checked out multiple
> times)

Oh, how can that be?  Oh, you mean the documentation is wrong?  Well, fancy that.

>> Then we add an indirection from the NODES table via the index columns
>> whenever we access those columns.  And some means of controlling
>> creation and deletion of those rows, perhaps using a ref-count.
>
> Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries and
> even more for cascading updates to provider better maintainability, while a
> typical checkout working copy only contains nodes once?
>
> I did a lot of performance work to get at the overall 15% speedup we have
> since 1.7. Do you have any ideas on how much of that we will lose after this
> suggested refactoring?

No, I have no idea.  Guess... guess... 0.1%?

Seriously, please let me put out a design refactoring proposal without beating me down with the "slowness" argument.  I know speed is important, but at the moment I'm concerned WC maintenance is soon going to die.  To me it looks just as bad (in terms of undocumented complexity) as WC-1.0.  It's a good effort, don't get me wrong -- I applaud the success of the WC rewrite and the work you've put in to make it fast as well as correct.  That's all great, but I occasionally look at the code and the (non-)documentation and close the file again and leave it to someone else.

If the idea is rubbish because it has unfeasible performance, then fine, we won't do it.  I'm sure you know much more than I do about WC DB performance, so maybe you're right.

I *know* you and others are concerned about "performance" (I put it in
quotes because it's referring to just one kind of performance: that is,
speed of performing a given operation).  Sure that kind of performance is important, I don't deny it, and I don't want to needlessly reduce it either.

But how about recognizing that there's some value to my suggestion and seeking a way to keep the code-organization value without hitting the performance penalty, instead of just hitting me with "doing that will add some run-time cost" implying that's so wrong it doesn'y bear thinking about.

Maybe we could make a tiny but  constructive start by adding a note in the documentation that those columns all belong "together" in the sense of being attributes of a node-rev, and maybe use some C preprocessor macros to start handling them all together where it makes sense, instead of leaving developers to try to remember which eight of the twenty-three arguments to a SELECT statement or a 'read_info' call should all be null or all be non-null together.

> I'm no fan of adding yet another layer in the working copy to just
> theoretically improve maintainability. That layer will also require
> maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
> we are slowly moving to a decentralized DVCS and maybe the storage should
> then also move to such a model.

Sorry, I'm not following.

>> I acknowledge that such a development is far from trivial, and I don't
>> expect Paul to do it as part of this work.  I just want to get the
>> proposal out there and see if there is any consensus for working toward
>> something like it.
>>
>> In terms of the inheritable props branch, I wonder if there's any way
>> we can move a bit closer to this proposal without going the whole hog,
>> but I can't think of anything.
>
> I don't think this proposal, the inherited properties and/or the
> conflicts-skels proposals are directly related.
>
>
> And before we look at this or similar refactorings I think we should measure
> the impact on some 'simple' case like 'svn status', that would
> have at least
> an additional lookup for every layer of every node for the foreign key
> handling.
>
>     Bert

- Julian
Received on 2012-09-14 00:27:56 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.