On Tue, Sep 28, 2010 at 08:46, Philip Martin <philip.martin_at_wandisco.com> wrote:
> Julian Foad <julian.foad_at_wandisco.com> writes:
>
>>> I believe the answer is "often". A simple 'svn status' will need to
>>> distinguish between 'A' and 'R', and that is done by checking
>>> prior_deleted.
>>>
>>> And no... this amount of denormalization will not hurt us.
>>
>> OK. I know that "svn status" speed is a big deal.
>
> I'm not sure prior_deleted is an optimisation. When I last looked at
Great analysis below. I'll take your word for it :-)
prior_deleted is effectively something like base_shadowed, and yeah...
if we're iterating over all nodes within a parent, then it can be
generated "simply".
> the performance of SQLite I concluded that status would be too slow if
> it ran per-node queries, it has to run per-dir queries. So
>
> SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2
>
> is too slow, we need to run
>
> SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2
>
> iterate over the rows caching the data and then generate the status
> for each child.
To do this, it seems that we're going to need to expose (from wc_db.h)
a structure containing "all" the row data. Thankfully, this big ol'
structure is *private* to libsvn_wc, and we can change it at will
(unlike svn_wc_status_t). I would suggest that we use a callback --
wc_db can step through each row of the result, fill in the structure,
and execute a callback (clearing an iterpool on each "step" with the
cursor).
The one tricky part to a callback is eliding all but the highest
op_depth. Does an ORDER BY in the query affect its runtime at all?
>...
> but that nested SELECT is expensive. It's not as bad as doing
> per-node queries but it is still too slow, the database query alone is
> slower than 1.6 status. I don't think this is something that can be
> fixed using an index because on my test data the runtime generally
> goes up by orders of magnitude when the indexing is wrong.
Yeah... the more indexes present, the more that need to be maintained
when rows are modified.
> I can get round this by querying for all op_depth and using the cache
> to select the highest. The cache is a hash, keyed on local_relpath,
> that stores the data associated with the highest op_depth seen and it
> simply discards/overwrites lower op_depth. I've prototyped this and
If we order by local_relpath, then the "cache" is simply one row
(hanging out in a structure, waiting to be passed to that callback).
> it's as fast as the per-dir BASE_NODE query on my test data. This is
> not surprising since my test data consists of mostly single op_depth
> with a few double op_depth. We have to have good performance on such
> working copies because they are the most common, I think it will be
> unusual to have a large working copy where lots of nodes have a large
> number of op_depth.
Agreed.
> Now to prior_deleted. The fundamental status query looks like
>
> SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2
>
> and all op_depth are seen. It's quite simple to cache the two highest
> op_depth so prior_deleted only provides information that is already
> available. It's not an optimisation for status, if anything it will
> make the database bigger and slower.
Again, thanks for the analysis. Okay... let's skip it!
>...
Cheers,
-g
Received on 2010-09-28 17:31:06 CEST