[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: server-side log cache

From: Stefan Fuhrmann <eqfox_at_web.de>
Date: Mon, 19 Sep 2011 10:42:55 +0200

On 29.08.2011 18:35, Stefan Sperling wrote:

Sorry for the late response. I have been knocked
out for a while.
> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>> for what this is all about.
>> But the assumptions in that file are actually not valid.
> Which ones are invalid? Can you explain in detail?
It makes at least 5 implicit assumptions:

* Noderevs are the relevant level on which user-level
   questions can be answered.
-> That might not be the case. Noderevs are a purely
    internal construct. In particular, a noderev might get
    copied but the actual delta is empty. Thinking about
    svn obliterate & friends that might actually happen
    as a result of some node removal / replacement.

* User-level queries (where got node copied to?) and
   internal queries (what are the copy dependencies?)
   are looking for the same information.
-> This is linked to the previous. Noderev copy-to info
    can be useful to determine the dependency graph.
    The point here is that we are mixing queries on two
    different layers here. If we need both, they should
    both exist as separate APIs.

* Noderevs changes are relevant for all svn obliterate.
-> That may not always be true. If you only want to
    remove "unused" history, the references in the skip-
    delta tree express the relevant dependencies.

* Noderev changes are the sufficient to answer the
   "where has this been copied to?" question.
-> They are quite unhelpful here. Noderevs might
    help find you changes on the node itself (see first
    assumption). But you need to evaluate the user-
    level copy operations of all parent folders for all
    currently known copies. There are far less of these
    copies (e.g. branch creation) than noderev changes
    (typ. multiple per revision).

* A noderev-based cache alone would reduce the
   amount of data to inspect / process for these queries.
-> The noderev cache is *larger* than e.g. a log cache
    due to each change also modifying all parent nodes.
    To be efficient, you need a node-level copy-to cache
    as well (see previous item).
>
>> * "Where did path P at rev N to M directly or indirectly
>> copied to, e.g. which releases contain a certain faulty
>> code segment; optionally list changes to targets?"
>> -> needs to scan parent paths for copies, too
>> (reversed "log", revision graph)
> Yes, the successor-id cache only gives us operation roots.
> Information for child nodes needs to be derived -- it is not
> within the scope of the cache itself.
That is unnecessarily complicated / expensive.
Using the proposed log cache, the relevant
information can be found in a *single*, highly
efficient scan.
>
>> It turns out that we can produce even humongous
>> reverse logs (50+ k nodes, thousands of branches
>> and tags) within a second by simply performing
>> a full history scan.
>>
>> A example of how the whole process can be
>> implemented efficiently, can be found here:
>>
>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> I'll take a look at that, thanks!
The gist of it is that you keep all currently known
copy targets (e.g. branches, tags) in a tree. For
each revision, you walk that tree to find the affected
sub-tree. The LRU path gets optimized to speed up
the next look-up.

A usually fixed repo layout plus typical change
patterns result in a quasi-constant processing
time for each revision - independent from the
number of copies already found.

Things become much simpler and faster if you
eliminate string operations and use path IDs
instead. TSVN does an "svn log" at over 10M
revs per second.
>>> Storage of successor IDs is essentially a cache of the result of the
>>> inversion of the predecessor relation between all node-revisions.
>>> This inversion is a *very* expensive operation (follow every node
>>> that ever existed in the repository from its most recent revision
>>> back to the revision it was created in).
>> Not true. At least the "*very* expensive" part.
>> Log -v takes about 1min for AFS over ra_local.
>> Building any of the index data described below
>> can be done in< 10s.
> Any of it (if so, which parts?), or all of it?
All of it, of course. The expensive part is "svn log -v"
which may be expensive on slow disks. Everything
else is just writing a few MB of data.
>> I propose a modified version of TSVN's log cache
>> data structure. The adaptations:
>>
>> * split into multiple files to reduce modification overhead
>> * remove rev-prop-dependent information (log comment,
>> author, timestamp)
>> * add the reverse copy info
>> * simplify the data structures
> This looks very interesting.
>
> What about FSFS-specific requirements?
See assumptions above, this may require a different
data structure. But I think that noderev dependencies
will turn out to be redundant if you have a log cache
and access to the skip-delta forwards dependencies.
> It sounds like you avoid those by storing data in semantics of the repos
> layer (path_at_revision) instead of the FS layer (node-revision-id)?
Yes.
> In this case separate implementations for FSFS and BDB aren't needed.
> This could be an advantage (e.g. third party FS implementations
> wouldn't need to change to support this).
It also eliminates on of the performance weaknesses
of SVN today: A log on some old / seldom changed
path can take a very long time.
>
> I'll think about this some more, thanks.
>
Welcome ;)

-- Stefan^2.
Received on 2011-09-19 10:43:39 CEST

This is an archived mail posted to the Subversion Dev mailing list.