Re: add NODES.prior_deleted boolean column

From: Greg Stein <gstein_at_gmail.com>
Date: Tue, 28 Sep 2010 11:30:27 -0400

On Tue, Sep 28, 2010 at 08:46, Philip Martin <philip.martin_at_wandisco.com> wrote:
> Julian Foad <julian.foad_at_wandisco.com> writes:
>
>>> I believe the answer is "often". A simple 'svn status' will need to
>>> distinguish between 'A' and 'R', and that is done by checking
>>> prior_deleted.
>>>
>>> And no... this amount of denormalization will not hurt us.
>>
>> OK. I know that "svn status" speed is a big deal.
>
> I'm not sure prior_deleted is an optimisation. When I last looked at

Great analysis below. I'll take your word for it :-)

prior_deleted is effectively something like base_shadowed, and yeah...
if we're iterating over all nodes within a parent, then it can be
generated "simply".

> the performance of SQLite I concluded that status would be too slow if
> it ran per-node queries, it has to run per-dir queries. So
>
> SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2
>
> is too slow, we need to run
>
> SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2
>
> iterate over the rows caching the data and then generate the status
> for each child.

To do this, it seems that we're going to need to expose (from wc_db.h)
a structure containing "all" the row data. Thankfully, this big ol'
structure is *private* to libsvn_wc, and we can change it at will
(unlike svn_wc_status_t). I would suggest that we use a callback --
wc_db can step through each row of the result, fill in the structure,
and execute a callback (clearing an iterpool on each "step" with the
cursor).

The one tricky part to a callback is eliding all but the highest
op_depth. Does an ORDER BY in the query affect its runtime at all?

>...
> but that nested SELECT is expensive. It's not as bad as doing
> per-node queries but it is still too slow, the database query alone is
> slower than 1.6 status. I don't think this is something that can be
> fixed using an index because on my test data the runtime generally
> goes up by orders of magnitude when the indexing is wrong.

Yeah... the more indexes present, the more that need to be maintained
when rows are modified.

> I can get round this by querying for all op_depth and using the cache
> to select the highest. The cache is a hash, keyed on local_relpath,
> that stores the data associated with the highest op_depth seen and it
> simply discards/overwrites lower op_depth. I've prototyped this and

If we order by local_relpath, then the "cache" is simply one row
(hanging out in a structure, waiting to be passed to that callback).

> it's as fast as the per-dir BASE_NODE query on my test data. This is
> not surprising since my test data consists of mostly single op_depth
> with a few double op_depth. We have to have good performance on such
> working copies because they are the most common, I think it will be
> unusual to have a large working copy where lots of nodes have a large
> number of op_depth.

Agreed.

> Now to prior_deleted. The fundamental status query looks like
>
> SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2
>
> and all op_depth are seen. It's quite simple to cache the two highest
> op_depth so prior_deleted only provides information that is already
> available. It's not an optimisation for status, if anything it will
> make the database bigger and slower.

Again, thanks for the analysis. Okay... let's skip it!

>...

Cheers,
-g
Received on 2010-09-28 17:31:06 CEST

This message: [ Message body ]
Next message: Ramkumar Ramachandra: "Re: Some tips on profiling"
Previous message: Ramkumar Ramachandra: "Re: svnrdump tool"
In reply to: Philip Martin: "Re: add NODES.prior_deleted boolean column"
Next in thread: Philip Martin: "Re: add NODES.prior_deleted boolean column"
Reply: Philip Martin: "Re: add NODES.prior_deleted boolean column"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]