server-side log cache (was: Re: FSFS successor ID design draft)

From: Stefan Sperling <stsp_at_elego.de>
Date: Mon, 29 Aug 2011 18:35:46 +0200

On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >for what this is all about.
> But the assumptions in that file are actually not valid.

Which ones are invalid? Can you explain in detail?

> * "Where did path P at rev N to M directly or indirectly
> copied to, e.g. which releases contain a certain faulty
> code segment; optionally list changes to targets?"
> -> needs to scan parent paths for copies, too
> (reversed "log", revision graph)

Yes, the successor-id cache only gives us operation roots.
Information for child nodes needs to be derived -- it is not
within the scope of the cache itself.

> It turns out that we can produce even humongous
> reverse logs (50+ k nodes, thousands of branches
> and tags) within a second by simply performing
> a full history scan.
>
> A example of how the whole process can be
> implemented efficiently, can be found here:
>
> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp

I'll take a look at that, thanks!

> >Storage of successor IDs is essentially a cache of the result of the
> >inversion of the predecessor relation between all node-revisions.
> >This inversion is a *very* expensive operation (follow every node
> >that ever existed in the repository from its most recent revision
> >back to the revision it was created in).
> Not true. At least the "*very* expensive" part.
> Log -v takes about 1min for AFS over ra_local.
> Building any of the index data described below
> can be done in < 10s.

Any of it (if so, which parts?), or all of it?

> I propose a modified version of TSVN's log cache
> data structure. The adaptations:
>
> * split into multiple files to reduce modification overhead
> * remove rev-prop-dependent information (log comment,
> author, timestamp)
> * add the reverse copy info
> * simplify the data structures

This looks very interesting.

What about FSFS-specific requirements?
It sounds like you avoid those by storing data in semantics of the repos
layer (path_at_revision) instead of the FS layer (node-revision-id)?
In this case separate implementations for FSFS and BDB aren't needed.
This could be an advantage (e.g. third party FS implementations
wouldn't need to change to support this).

I'll think about this some more, thanks.
Received on 2011-08-29 18:36:20 CEST

This message: [ Message body ]
Next message: Hyrum K Wright: "Re: some "before 1.7 we will" comments"
Previous message: C. Michael Pilato: "Re: Recurse into same-repos externals at commit time."
In reply to: Stefan Fuhrmann: "Re: FSFS successor ID design draft"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]