Re: server-side log cache

From: Stefan Fuhrmann <eqfox_at_web.de>
Date: Thu, 22 Sep 2011 20:43:14 +0200

On 20.09.2011 23:40, Daniel Shahaf wrote:
> Re-reading this to see how it affects the fs-successor-ids work (in
> particular, whether it supersedes the done/planned work), one question:
>
>
> Why do you refer to skip-deltas?
I haven't thought too deeply about the skip-delta issue
but that is my reasoning: For all questions concerning
the "change flow", the noderev level seems to be the
wrong place to go as I tried to explain.

If you are trying to answer repo-internal questions like
"can I remove this chunk of data?", the dependencies
between the individual representations (skip delta,
shared nodereps) may be more interesting.
>
>
> Consider the following transformation to a repository:
>
> for noderev in FS:
> noderev.props['svn:contents'] = noderev.cat()
> noderev.contents = ""
>
> This transformation does not change the successor relations, but it does
> change the skip-deltas mapping. Would your algorithms for
> successors/copyto work correctly on a repository that has undergone the
> above transformation?
My algorithms don't care about noderevs or any layer
below that (including representations). The skip-delta
argument was only to illustrate that the noderevs cache
might be even less helpful.

-- Stefan^2.
>
> Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
>> On 29.08.2011 18:35, Stefan Sperling wrote:
>>
>> Sorry for the late response. I have been knocked
>> out for a while.
>>> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>>>> for what this is all about.
>>>> But the assumptions in that file are actually not valid.
>>> Which ones are invalid? Can you explain in detail?
>> It makes at least 5 implicit assumptions:
>>
>> * Noderevs are the relevant level on which user-level
>> questions can be answered.
>> -> That might not be the case. Noderevs are a purely
>> internal construct. In particular, a noderev might get
>> copied but the actual delta is empty. Thinking about
>> svn obliterate& friends that might actually happen
>> as a result of some node removal / replacement.
>>
>> * User-level queries (where got node copied to?) and
>> internal queries (what are the copy dependencies?)
>> are looking for the same information.
>> -> This is linked to the previous. Noderev copy-to info
>> can be useful to determine the dependency graph.
>> The point here is that we are mixing queries on two
>> different layers here. If we need both, they should
>> both exist as separate APIs.
>>
>> * Noderevs changes are relevant for all svn obliterate.
>> -> That may not always be true. If you only want to
>> remove "unused" history, the references in the skip-
>> delta tree express the relevant dependencies.
>>
>> * Noderev changes are the sufficient to answer the
>> "where has this been copied to?" question.
>> -> They are quite unhelpful here. Noderevs might
>> help find you changes on the node itself (see first
>> assumption). But you need to evaluate the user-
>> level copy operations of all parent folders for all
>> currently known copies. There are far less of these
>> copies (e.g. branch creation) than noderev changes
>> (typ. multiple per revision).
>>
>> * A noderev-based cache alone would reduce the
>> amount of data to inspect / process for these queries.
>> -> The noderev cache is *larger* than e.g. a log cache
>> due to each change also modifying all parent nodes.
>> To be efficient, you need a node-level copy-to cache
>> as well (see previous item).
>>>> * "Where did path P at rev N to M directly or indirectly
>>>> copied to, e.g. which releases contain a certain faulty
>>>> code segment; optionally list changes to targets?"
>>>> -> needs to scan parent paths for copies, too
>>>> (reversed "log", revision graph)
>>> Yes, the successor-id cache only gives us operation roots.
>>> Information for child nodes needs to be derived -- it is not
>>> within the scope of the cache itself.
>> That is unnecessarily complicated / expensive.
>> Using the proposed log cache, the relevant
>> information can be found in a *single*, highly
>> efficient scan.
>>>> It turns out that we can produce even humongous
>>>> reverse logs (50+ k nodes, thousands of branches
>>>> and tags) within a second by simply performing
>>>> a full history scan.
>>>>
>>>> A example of how the whole process can be
>>>> implemented efficiently, can be found here:
>>>>
>>>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
>>> I'll take a look at that, thanks!
>> The gist of it is that you keep all currently known
>> copy targets (e.g. branches, tags) in a tree. For
>> each revision, you walk that tree to find the affected
>> sub-tree. The LRU path gets optimized to speed up
>> the next look-up.
>>
>> A usually fixed repo layout plus typical change
>> patterns result in a quasi-constant processing
>> time for each revision - independent from the
>> number of copies already found.
>>
>> Things become much simpler and faster if you
>> eliminate string operations and use path IDs
>> instead. TSVN does an "svn log" at over 10M
>> revs per second.
>>>>> Storage of successor IDs is essentially a cache of the result of the
>>>>> inversion of the predecessor relation between all node-revisions.
>>>>> This inversion is a *very* expensive operation (follow every node
>>>>> that ever existed in the repository from its most recent revision
>>>>> back to the revision it was created in).
>>>> Not true. At least the "*very* expensive" part.
>>>> Log -v takes about 1min for AFS over ra_local.
>>>> Building any of the index data described below
>>>> can be done in< 10s.
>>> Any of it (if so, which parts?), or all of it?
>> All of it, of course. The expensive part is "svn log -v"
>> which may be expensive on slow disks. Everything
>> else is just writing a few MB of data.
>>>> I propose a modified version of TSVN's log cache
>>>> data structure. The adaptations:
>>>>
>>>> * split into multiple files to reduce modification overhead
>>>> * remove rev-prop-dependent information (log comment,
>>>> author, timestamp)
>>>> * add the reverse copy info
>>>> * simplify the data structures
>>> This looks very interesting.
>>>
>>> What about FSFS-specific requirements?
>> See assumptions above, this may require a different
>> data structure. But I think that noderev dependencies
>> will turn out to be redundant if you have a log cache
>> and access to the skip-delta forwards dependencies.
>>> It sounds like you avoid those by storing data in semantics of the repos
>>> layer (path_at_revision) instead of the FS layer (node-revision-id)?
>> Yes.
>>> In this case separate implementations for FSFS and BDB aren't needed.
>>> This could be an advantage (e.g. third party FS implementations
>>> wouldn't need to change to support this).
>> It also eliminates on of the performance weaknesses
>> of SVN today: A log on some old / seldom changed
>> path can take a very long time.
>>> I'll think about this some more, thanks.
>>>
>> Welcome ;)
>>
>> -- Stefan^2.
Received on 2011-09-22 20:44:06 CEST

This message: [ Message body ]
Next message: Mattias Engdegård: "Re: Subversion 1.7.0 GO / NO-GO"
Previous message: Daniel Shahaf: "Re: svn commit: r1173935 - /subversion/trunk/subversion/include/private/svn_subr_private.h"
In reply to: Daniel Shahaf: "Re: server-side log cache"
Next in thread: Stefan Sperling: "Re: server-side log cache"
Reply: Stefan Sperling: "Re: server-side log cache"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]