[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Thu, 13 Sep 2012 23:49:55 +0100 (BST)

Bert: Sorry for getting emotionally worked up in my last response.  I know you're only expressing your concern about maintaining speed, and that's fine.

- Julian

----- Original Message -----
> From: Julian Foad <julianfoad_at_btopenworld.com>
> To: Bert Huijben <bert_at_qqmail.nl>
> Cc: 'Paul Burba' <ptburba_at_gmail.com>; 'Subversion Development' <dev_at_subversion.apache.org>
> Sent: Thursday, 13 September 2012, 18:27
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
>
> Bert Huijben wrote:
>
>>>   From: Julian Foad [mailto:julianfoad_at_btopenworld.com]
>>>
>>>   This branch caches properties of certain repository nodes, in a new
> place
>>> in  the WC DB.
>>>
>>>   I haven't examined it at all yet.  The Wiki page says: "The
> cache will be
>>> stored  in a new BLOB column in the NODES table called
> 'inherited_props'. 
>>> If the  node is the base node of a WC root then this column is
> populated
>>> with a  serialized skel of the node's inherited properties.  In all
> other
>>> cases  it is  NULL."
>>>
>>>   Another new special column in the NODES table... the twenty-third
>>>   column.  Do we have to?  Is it not complex enough already?
>>>
>>>   I want to make a radical suggestion.
>>>
>>>   What would it take to be able to share some data instead of creating a
>>>   special-purpose column just for this cache?  Well, it would take more
> work
>>>   initially, but hear me out, as I have a hunch it may make less work in
> the
>>> end.
>>>
>>>   I think we would do well to factor out the storage of all data about a
>>>   repository node-rev, moving that into a separate table that is
>>> *referenced*  from the NODES table.  That new table would be indexed by
>>> node-rev (that  is: repos_id, rev,
>>>   relpath) and have a column for the properties and perhaps other
> columns
>>>   for other attributes of the node-rev.
>>
>> That would be two additional table lookups per retrieval inside Sqlite3,
>> [...] Have you investigated the performance costs of this proposal?
>
> No.
>
>> I heard similar suggestions about properties before, but additional table
>> lookups are certainly not free in Sqlite. They are not expensive by itself,
>> but every extra db transaction is certainly a performance hit.
>>
>>>   What we need to know here (for inherited props) is the properties set
> on
>>>   each ancestor of ever WC "root node" (which is defined
> elsewhere and
>>>   includes switched nodes etc.)  That's not a complex requirement. 
> Instead
>>> of  creating a new column dedicated to storing the union of all the
> props
>>>   inherited from every ancestor node, how about we fetch and
> concatenate,
>>>   on request, the node-rev table rows for all the pathwise ancestors of
> the
>>>   desired node.  The cache maintenance then becomes ensuring that a set
> of
>>>   node-rev table rows are populated (populated in the standard way for
> such
>>>   rows), and that those rows are removed when no longer referenced.
>>>
>>>   But this particular use is not the only reason for factoring out this
> data
>>> into a  separate table.  There is:
>>>
>>>     - For inherited props, this use.
>>>
>>>     - For inherited props, if the current model of BASE node inheritance
>>>   remains, we're also going to need to cache the inherited props for
> every
>>>   mixed-rev node (that is, every node whose rev != its parent-in-BASE
>>>   rev).  (I'll write separately about why we need this for off-line
>>> operation.)
>>
>> I think this case was explicitly ignored.
>>
>> This would require that operations like commit start updating parts of the
>> working copy with information that is not available locally.
>
> Yes.  That's not a problem, is it?
>
>> Normally the commit shouldn't be able to introduce invalid inherited
> state
>> (or the commit would fail).
>
> Pardon?  I don't understand what you're saying there.
>
>> And an explicit update of a subtree already adds the necessary information
>> to the root (if the current state of the code still matches what I read
>> before)
>>
>>>     - We already have (I think) the case where a local WC-to-WC copy or
> move
>>>   of a tree results in poulating another set of tree nodes in the NODES
> table
>>>   with an identical copy of all the data.  Whenever we can reduce the
> amount
>>>   of such copying, if the additional complexity is not too great, that
> can
>>> make it  easier to achieve both correctness and efficiency.
>>
>> When we copy from BASE to another layer we don't have to copy the
> inherited
>> properties. The inheritance in the new location will be based on the new
>> url, not the original location. So the current queries already handle this
>> case by not copying the data from the BASE layer.
>
> I'm talking there about regular, explicit properties, not inherited
> properties.
>
>
>>>     - For conflicts, it's useful to store the "theirs" and
> "old base"
>>>   versions of the conflicted node.  Its kind, its properties, the
> reference
>>> to its  (pristine) text, and perhaps other attributes.
>>
>> On trunk everything is prepared to store conflicts in a skel, what has this
>> to do with the inherited property storage?
>
> It has nothing to do with inherited properties except that inherited properties
> and conflicts could both refer to this table of cached node-revs.
>
>> In format 30 we already have the urls and properties of the conflicted
> nodes
>> (not the pristine text yet, but with the url we can retrieve that or we can
>> implement storing the sha1 references in the separate columns as initially
>> planned)
>
> If we go this way, then I would suggest we migrate the conflict storage skel to
> use (repos_id, rev, relpath) references to node-rev rows instead of storing the
> props and other bits of metadata explicitly.
>
>
>>>     - I think it would make the WC *much* more maintainable.  At the
> moment,
>>>   there doesn't appear to be any codified connection between those
> eight
>>>   columns in NODES.  There doesn't even seem to be any comment in
> the doc
>>>   strings to indicate that.  It's almost impossible to analyze the
> huge
>>> query  statements to check that they're all updated together.
>>>
>>>   Therefore I propose a WC development that adds a new table (call it
>>>   NODEREV?), indexed by repos/rev/relpath, with eight further columns
> that
>>>   will be moved out of the current NODES table:
>>>
>>>     repos_id, revision, repos_path -- the index columns
>>>     kind
>>>     properties
>>>     checksum
>>>     symlink_target
>>>     changed_revision
>>>     changed_date
>>>     changed_author
>>>     translated_size
>>
>> (translated_size or really recorded_size shouldn't be in this list as
> it can
>> have a different value for the same file if it is checked out multiple
>> times)
>
> Oh, how can that be?  Oh, you mean the documentation is wrong?  Well, fancy
> that.
>
>
>>>   Then we add an indirection from the NODES table via the index columns
>>>   whenever we access those columns.  And some means of controlling
>>>   creation and deletion of those rows, perhaps using a ref-count.
>>
>> Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries
> and
>> even more for cascading updates to provider better maintainability, while a
>> typical checkout working copy only contains nodes once?
>>
>> I did a lot of performance work to get at the overall 15% speedup we have
>> since 1.7. Do you have any ideas on how much of that we will lose after
> this
>> suggested refactoring?
>
> No, I have no idea.  Guess... guess... 0.1%?
>
> Seriously, please let me put out a design refactoring proposal without beating
> me down with the "slowness" argument.  I know speed is important, but
> at the moment I'm concerned WC maintenance is soon going to die.  To me it
> looks just as bad (in terms of undocumented complexity) as WC-1.0.  It's a
> good effort, don't get me wrong -- I applaud the success of the WC rewrite
> and the work you've put in to make it fast as well as correct.  That's
> all great, but I occasionally look at the code and the (non-)documentation and
> close the file again and leave it to someone else.
>
> If the idea is rubbish because it has unfeasible performance, then fine, we
> won't do it.  I'm sure you know much more than I do about WC DB
> performance, so maybe you're right.
>
> I *know* you and others are concerned about "performance" (I put it in
>
> quotes because it's referring to just one kind of performance: that is,
> speed of performing a given operation).  Sure that kind of performance is
> important, I don't deny it, and I don't want to needlessly reduce it
> either.
>
> But how about recognizing that there's some value to my suggestion and
> seeking a way to keep the code-organization value without hitting the
> performance penalty, instead of just hitting me with "doing that will add
> some run-time cost" implying that's so wrong it doesn'y bear
> thinking about.
>
> Maybe we could make a tiny but  constructive start by adding a note in the
> documentation that those columns all belong "together" in the sense of
> being attributes of a node-rev, and maybe use some C preprocessor macros to
> start handling them all together where it makes sense, instead of leaving
> developers to try to remember which eight of the twenty-three arguments to a
> SELECT statement or a 'read_info' call should all be null or all be
> non-null together.
>
>
>> I'm no fan of adding yet another layer in the working copy to just
>> theoretically improve maintainability. That layer will also require
>> maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
>> we are slowly moving to a decentralized DVCS and maybe the storage should
>> then also move to such a model.
>
> Sorry, I'm not following.
>
>>>   I acknowledge that such a development is far from trivial, and I
> don't
>>> expect  Paul to do it as part of this work.  I just want to get the
>>> proposal out there  and see if there is any consensus for working
> toward
>>> something like it.
>>>
>>>   In terms of the inheritable props branch, I wonder if there's any
> way
>>> we can  move a bit closer to this proposal without going the whole hog,
>>> but I can't  think of anything.
>>
>> I don't think this proposal, the inherited properties and/or the
>> conflicts-skels proposals are directly related.
>>
>>
>> And before we look at this or similar refactorings I think we should
> measure
>> the impact on some 'simple' case like 'svn status', that
> would
>> have at least
>> an additional lookup for every layer of every node for the foreign key
>> handling.
>>
>>     Bert
>
> - Julian
>
Received on 2012-09-14 00:50:30 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.