Bert: Sorry for getting emotionally worked up in my last response. I know you're only expressing your concern about maintaining speed, and that's fine.
- Julian
----- Original Message -----
> From: Julian Foad <julianfoad_at_btopenworld.com>
> To: Bert Huijben <bert_at_qqmail.nl>
> Cc: 'Paul Burba' <ptburba_at_gmail.com>; 'Subversion Development' <dev_at_subversion.apache.org>
> Sent: Thursday, 13 September 2012, 18:27
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
>
> Bert Huijben wrote:
>
>>> From: Julian Foad [mailto:julianfoad_at_btopenworld.com]
>>>
>>> This branch caches properties of certain repository nodes, in a new
> place
>>> in the WC DB.
>>>
>>> I haven't examined it at all yet. The Wiki page says: "The
> cache will be
>>> stored in a new BLOB column in the NODES table called
> 'inherited_props'.
>>> If the node is the base node of a WC root then this column is
> populated
>>> with a serialized skel of the node's inherited properties. In all
> other
>>> cases it is NULL."
>>>
>>> Another new special column in the NODES table... the twenty-third
>>> column. Do we have to? Is it not complex enough already?
>>>
>>> I want to make a radical suggestion.
>>>
>>> What would it take to be able to share some data instead of creating a
>>> special-purpose column just for this cache? Well, it would take more
> work
>>> initially, but hear me out, as I have a hunch it may make less work in
> the
>>> end.
>>>
>>> I think we would do well to factor out the storage of all data about a
>>> repository node-rev, moving that into a separate table that is
>>> *referenced* from the NODES table. That new table would be indexed by
>>> node-rev (that is: repos_id, rev,
>>> relpath) and have a column for the properties and perhaps other
> columns
>>> for other attributes of the node-rev.
>>
>> That would be two additional table lookups per retrieval inside Sqlite3,
>> [...] Have you investigated the performance costs of this proposal?
>
> No.
>
>> I heard similar suggestions about properties before, but additional table
>> lookups are certainly not free in Sqlite. They are not expensive by itself,
>> but every extra db transaction is certainly a performance hit.
>>
>>> What we need to know here (for inherited props) is the properties set
> on
>>> each ancestor of ever WC "root node" (which is defined
> elsewhere and
>>> includes switched nodes etc.) That's not a complex requirement.
> Instead
>>> of creating a new column dedicated to storing the union of all the
> props
>>> inherited from every ancestor node, how about we fetch and
> concatenate,
>>> on request, the node-rev table rows for all the pathwise ancestors of
> the
>>> desired node. The cache maintenance then becomes ensuring that a set
> of
>>> node-rev table rows are populated (populated in the standard way for
> such
>>> rows), and that those rows are removed when no longer referenced.
>>>
>>> But this particular use is not the only reason for factoring out this
> data
>>> into a separate table. There is:
>>>
>>> - For inherited props, this use.
>>>
>>> - For inherited props, if the current model of BASE node inheritance
>>> remains, we're also going to need to cache the inherited props for
> every
>>> mixed-rev node (that is, every node whose rev != its parent-in-BASE
>>> rev). (I'll write separately about why we need this for off-line
>>> operation.)
>>
>> I think this case was explicitly ignored.
>>
>> This would require that operations like commit start updating parts of the
>> working copy with information that is not available locally.
>
> Yes. That's not a problem, is it?
>
>> Normally the commit shouldn't be able to introduce invalid inherited
> state
>> (or the commit would fail).
>
> Pardon? I don't understand what you're saying there.
>
>> And an explicit update of a subtree already adds the necessary information
>> to the root (if the current state of the code still matches what I read
>> before)
>>
>>> - We already have (I think) the case where a local WC-to-WC copy or
> move
>>> of a tree results in poulating another set of tree nodes in the NODES
> table
>>> with an identical copy of all the data. Whenever we can reduce the
> amount
>>> of such copying, if the additional complexity is not too great, that
> can
>>> make it easier to achieve both correctness and efficiency.
>>
>> When we copy from BASE to another layer we don't have to copy the
> inherited
>> properties. The inheritance in the new location will be based on the new
>> url, not the original location. So the current queries already handle this
>> case by not copying the data from the BASE layer.
>
> I'm talking there about regular, explicit properties, not inherited
> properties.
>
>
>>> - For conflicts, it's useful to store the "theirs" and
> "old base"
>>> versions of the conflicted node. Its kind, its properties, the
> reference
>>> to its (pristine) text, and perhaps other attributes.
>>
>> On trunk everything is prepared to store conflicts in a skel, what has this
>> to do with the inherited property storage?
>
> It has nothing to do with inherited properties except that inherited properties
> and conflicts could both refer to this table of cached node-revs.
>
>> In format 30 we already have the urls and properties of the conflicted
> nodes
>> (not the pristine text yet, but with the url we can retrieve that or we can
>> implement storing the sha1 references in the separate columns as initially
>> planned)
>
> If we go this way, then I would suggest we migrate the conflict storage skel to
> use (repos_id, rev, relpath) references to node-rev rows instead of storing the
> props and other bits of metadata explicitly.
>
>
>>> - I think it would make the WC *much* more maintainable. At the
> moment,
>>> there doesn't appear to be any codified connection between those
> eight
>>> columns in NODES. There doesn't even seem to be any comment in
> the doc
>>> strings to indicate that. It's almost impossible to analyze the
> huge
>>> query statements to check that they're all updated together.
>>>
>>> Therefore I propose a WC development that adds a new table (call it
>>> NODEREV?), indexed by repos/rev/relpath, with eight further columns
> that
>>> will be moved out of the current NODES table:
>>>
>>> repos_id, revision, repos_path -- the index columns
>>> kind
>>> properties
>>> checksum
>>> symlink_target
>>> changed_revision
>>> changed_date
>>> changed_author
>>> translated_size
>>
>> (translated_size or really recorded_size shouldn't be in this list as
> it can
>> have a different value for the same file if it is checked out multiple
>> times)
>
> Oh, how can that be? Oh, you mean the documentation is wrong? Well, fancy
> that.
>
>
>>> Then we add an indirection from the NODES table via the index columns
>>> whenever we access those columns. And some means of controlling
>>> creation and deletion of those rows, perhaps using a ref-count.
>>
>> Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries
> and
>> even more for cascading updates to provider better maintainability, while a
>> typical checkout working copy only contains nodes once?
>>
>> I did a lot of performance work to get at the overall 15% speedup we have
>> since 1.7. Do you have any ideas on how much of that we will lose after
> this
>> suggested refactoring?
>
> No, I have no idea. Guess... guess... 0.1%?
>
> Seriously, please let me put out a design refactoring proposal without beating
> me down with the "slowness" argument. I know speed is important, but
> at the moment I'm concerned WC maintenance is soon going to die. To me it
> looks just as bad (in terms of undocumented complexity) as WC-1.0. It's a
> good effort, don't get me wrong -- I applaud the success of the WC rewrite
> and the work you've put in to make it fast as well as correct. That's
> all great, but I occasionally look at the code and the (non-)documentation and
> close the file again and leave it to someone else.
>
> If the idea is rubbish because it has unfeasible performance, then fine, we
> won't do it. I'm sure you know much more than I do about WC DB
> performance, so maybe you're right.
>
> I *know* you and others are concerned about "performance" (I put it in
>
> quotes because it's referring to just one kind of performance: that is,
> speed of performing a given operation). Sure that kind of performance is
> important, I don't deny it, and I don't want to needlessly reduce it
> either.
>
> But how about recognizing that there's some value to my suggestion and
> seeking a way to keep the code-organization value without hitting the
> performance penalty, instead of just hitting me with "doing that will add
> some run-time cost" implying that's so wrong it doesn'y bear
> thinking about.
>
> Maybe we could make a tiny but constructive start by adding a note in the
> documentation that those columns all belong "together" in the sense of
> being attributes of a node-rev, and maybe use some C preprocessor macros to
> start handling them all together where it makes sense, instead of leaving
> developers to try to remember which eight of the twenty-three arguments to a
> SELECT statement or a 'read_info' call should all be null or all be
> non-null together.
>
>
>> I'm no fan of adding yet another layer in the working copy to just
>> theoretically improve maintainability. That layer will also require
>> maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
>> we are slowly moving to a decentralized DVCS and maybe the storage should
>> then also move to such a model.
>
> Sorry, I'm not following.
>
>>> I acknowledge that such a development is far from trivial, and I
> don't
>>> expect Paul to do it as part of this work. I just want to get the
>>> proposal out there and see if there is any consensus for working
> toward
>>> something like it.
>>>
>>> In terms of the inheritable props branch, I wonder if there's any
> way
>>> we can move a bit closer to this proposal without going the whole hog,
>>> but I can't think of anything.
>>
>> I don't think this proposal, the inherited properties and/or the
>> conflicts-skels proposals are directly related.
>>
>>
>> And before we look at this or similar refactorings I think we should
> measure
>> the impact on some 'simple' case like 'svn status', that
> would
>> have at least
>> an additional lookup for every layer of every node for the foreign key
>> handling.
>>
>> Bert
>
> - Julian
>
Received on 2012-09-14 00:50:30 CEST