Can anyone spare a bit of time to take this forward? I'll tell you what
I'm planning to do next. Maybe someone can get to this before I can.
The first thing I want to do is add a DB index on the 'refcount' column
of the 'pristine' table. I'm convinced that will be necessary for
adequate performance. To do so involves:
* Adding the 'CREATE INDEX' SQL statement. That's simple - just
copy/modify/paste one line in wc-metadata.sql.
* Adding a WC format bump step that adds the index to existing dev
WCs. Also quite simple.
* I'd be happier if I ran a timing comparison, before and after, with
searching for unreferenced pristines (those whose refcount is 0). That
would give me confidence that adding the index has actually happened and
that it has the desired effect. It would need a test with a large
number of pristines, a scattered selection of them having refcount=0,
and then time the running of svn_wc__db_pristine_cleanup(). With an
index, this should find them all very quickly and the deletion time
should dominate the whole cleanup operation, whereas without an index
the time taken to scan the whole table should dominate when the table is
very large.
On Thu, 2011-01-13, Branko Čibej wrote:
> On 13.01.2011 20:01, Julian Foad wrote:
> > I have committed the ref counting for pristine texts (r1058523) and have
> > had a bit more insight into when to perform the deletion.
> >
> > Deleting unreferenced texts on closing a "wcroot" is too late - too
> > large a granularity - for the way the WC code is currently structured
> > because this doesn't happen until a (long-lived) pool is cleaned up, as
> > Bert pointed out.
> >
> > Deleting unreferenced texts after running the Work Queue is too soon -
> > too fine a granularity - for the way "commit" is currently structured.
> > A simple test of this approach showed that by the time the post-commit
> > processing tries to install a reference to a pristine text whose
> > checksum was noted earlier, in some cases that pristine row has already
> > been purged.
>
> This would indicate that the reference counting happens too soon ... in
> other words, that a pristine can be dereferenced whilst some part of the
> code (or database) still refers to it. That breaks database consistency
> -- what happens if the user aborts a commit and then calls 'svn
> cleanup', for example?
If what I said about 'commit' is correct, then yes, that's bad and we
should look for a better way. But I haven't tested it properly; I
noticed that the commit failed saying that the DB failed a 'REFERENCES'
clause, and what I said here is a hypothesis about how that happened.
Investingating this is the second thing I would want to do next.
> > At the moment I think the practical solution is to insert calls to the
> > pristine cleanup at the end of each main libsvn_wc public API that may
> > have freed up one or more pristines.
>
> Mmm. Too many points of failure, don't you think? Of course, it's no
> "worse" than the cancellation handlers, however, and I suppose missing a
> chance to delete is better than having one too many.
I'm thinking this is nevertheless the best option. The consequence of
missing some places is not too bad at all. If we miss places where we
should clean up, then all it means is that some operations will leak
unused pristines, but they won't be leaked permanently, only until the
user runs one of the operations that do have the cleanup call in them.
That's a consequence of calling a global "clean up all unreferenced
pristines" function rather than trying to keep track of which pristines
have been freed up by *this* operation.
> > Wherever we do it, if it's at all frequent, the search for unreferenced
> > pristines needs to be efficient. At present I use "SELECT ... FROM
> > PRISTINE WHERE refcount = 0". I believe that can most easily be made
> > efficient by adding an index on the 'refcount' column, so that's what I
> > intend to do.
>
> That's the only thing you can do, unless you want to enhance the
> triggers to populate a queue of to-be-deleted pristines.
I agree that adding an index on the table is a better solution than
trying to maintain a separate queue of the pristines that are waiting to
be cleaned up. The index makes it efficient for a simple query
('refcount = 0') to provide the same information as the separate queue
would provide.
- Julian
Received on 2011-01-18 16:59:26 CET