Re: Do we need to keep mergeinfo in the mergeinfo database?

From: Michael Sinz <Michael.Sinz_at_sinz.org>
Date: 2007-11-29 20:29:32 CET

On Nov 29, 2007 12:47 AM, David Glasser <glasser@davidglasser.net> wrote:
>
> On Nov 28, 2007 2:25 PM, David Glasser <glasser@davidglasser.net> wrote:
> > So I noticed (http://svn.haxx.se/dev/archive-2007-11/1168.shtml) that
> > we can currently make the mergeinfo db be out of sync with the
> > svn:mergeinfo properties. What this means is that the indexing code
> > needs to be fixed. Now, the indexing code is pretty complicated; it
> > has all sorts of "parse and unparse mergeinfo in and out of the table"
> > stuff. We certainly can fix it. But this set me off to figuring out
> > what we're actually using the mergeinfo code for.
> >
> >
> > CURRENT USE
> > ===========
> >
> > Currently (and yes, this is ignoring the issue-2897 branch, though
> > whether or not that branch is doing the right thing is still under
> > debate) mergeinfo-sqlite-index.c only contains three SELECT
> > statements (not counting an optimization done during commit). There
> > are also only two APIs that the FS modules use to read from the DB:
> > svn_fs_mergeinfo__get_mergeinfo and
> > svn_fs_mergeinfo__get_mergeinfo_for_tree.
> >
> > SELECT mergedfrom, mergedrevstart, mergedrevend, inheritable FROM
> > mergeinfo WHERE mergedto = ? AND revision = ? ORDER BY mergedfrom,
> > mergedrevstart;
> >
> > I.e., "get the mergeinfo on MERGEDTO@REVISION". Semantically
> > equivalent to just doing svn_fs_node_prop(&value, rev_root(REVISION),
> > MERGEDTO, "svn:mergeinfo"). (And in fact one of the two APIs
> > converts it back into a string anyway.) Note that the code guarantees
> > that REVISION is a rev where mergeinfo changed on the path.
> >
> > SELECT MAX(revision) FROM mergeinfo_changed WHERE path = ? AND
> > revision <= ?;
> >
> > Ie, "find the last revision before REVISION where mergeinfo changed on
> > PATH".
> >
> > SELECT MAX(revision), path FROM mergeinfo_changed WHERE path LIKE ?
> > AND revision <= ? GROUP BY path;
> >
> > (PATH is always "some/path/%" here.)
> > Ie, "find the paths under PATH which have mergeinfo at REVISION (or
> > have ever had
> > mergeinfo before REVISION, I think?) and return them with the last
> > changed revision".
> >
> >
> >
> > OBSERVATIONS
> > ============
> >
> > Note a couple things here:
> >
> > * The first SELECT statement is completely redundant with
> > svn_fs_node_prop.
> >
> > * The first SELECT statement is the only thing that uses any column
> > other than path name (mergedto/path) and revision.
> >
> > * The only reason for the second SELECT statement is to make sure to
> > pass the right revision to the first SELECT statement. In fact, if
> > we just looked on the DB itself, the second statement wouldn't be
> > necessary at all.
> >
> > * The third SELECT statement is actually doing something interesting:
> > it's doing a prefix-match. That is, it's doing the equivalent of a
> > recursive tree-walk in the FS, without actually needing to walk the
> > tree. (And as of r28077, this actually gets to use an index.)
> >
> >
> >
> > PROPOSAL
> > ========
> >
> > Let's move the mergeinfo out of the mergeinfo database. Let's just
> > make the mergeinfo database have one table:
> >
> > CREATE TABLE mergeinfo
> > ( path TEXT NOT NULL,
> > revision INTEGER NOT NULL,
> > action INTEGER NOT NULL );
> >
> > where action is an enum representing 'added', 'modified', or
> > 'deleted'.
> >
> > svn_fs_mergeinfo__get_mergeinfo should not touch the database at all.
> >
> > svn_fs_mergeinfo__get_mergeinfo_for_tree will use the database to
> > figure out which paths under P actually have mergeinfo, and use the FS
> > to find out the actual mergeinfo data.
> >
> > (One side effect of this is that there will be a lot less work done to
> > update the tables, which means less work done with the FSFS write lock
> > held, which is good. My guess is that this change, if feasible, will
> > improve the efficiency of commits. I have no idea what the
> > performance impact on read ops will be; it'll keep the index table
> > smaller, but require more FS access.)
>
> Actually, if you accept that the only operation we need to speed up is
> "find all nodes with mergeinfo under path P", then even this
> single-table solution is overkill.
>
> I think we can ditch sqlite entirely and still keep that operation efficient.

+1 - I generally like that idea - mainly due to the complexity of
having the same information in two different forms.

> Just add a new field to the node-rev hash: "number of nodes with
> mergeinfo below me (including myself)". This number is pretty much
> trivial to keep accurate, and then we only need to descend down paths
> with it >0 in a recursive FS crawl.

Making the FS operations be able to support the behaviors needed at
reasonable performance levels seems to me to be a better way than to
try to keep two disjoint datasets in sync with each other. (And would
make the whole dump/load process a cleaner, more manageable one).

As we grow our needs for this metadata we can grow the access
methods/optimizations. (or even move to a database if all else fails -
but then maybe a different storage backend would be even more useful)

-- 
Michael Sinz               Technology and Engineering Director/Consultant
"Starting Startups"                          mailto:Michael.Sinz@sinz.org
My place on the web                      http://www.sinz.org/Michael.Sinz
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Thu Nov 29 20:29:55 2007

This message: [ Message body ]
Next message: Michael Sinz: "Re: Issue 2897 revisited. Really."
Previous message: David Glasser: "Re: Do we need to keep mergeinfo in the mergeinfo database?"
In reply to: David Glasser: "Re: Do we need to keep mergeinfo in the mergeinfo database?"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]