Re: server-side log cache

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Wed, 21 Sep 2011 00:40:50 +0300

Re-reading this to see how it affects the fs-successor-ids work (in
particular, whether it supersedes the done/planned work), one question:

Why do you refer to skip-deltas?

Consider the following transformation to a repository:

  for noderev in FS:
    noderev.props['svn:contents'] = noderev.cat()
    noderev.contents = ""

This transformation does not change the successor relations, but it does
change the skip-deltas mapping. Would your algorithms for
successors/copyto work correctly on a repository that has undergone the
above transformation?

Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
> On 29.08.2011 18:35, Stefan Sperling wrote:
>
> Sorry for the late response. I have been knocked
> out for a while.
> >On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >>>See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >>>for what this is all about.
> >>But the assumptions in that file are actually not valid.
> >Which ones are invalid? Can you explain in detail?
> It makes at least 5 implicit assumptions:
>
> * Noderevs are the relevant level on which user-level
> questions can be answered.
> -> That might not be the case. Noderevs are a purely
> internal construct. In particular, a noderev might get
> copied but the actual delta is empty. Thinking about
> svn obliterate & friends that might actually happen
> as a result of some node removal / replacement.
>
> * User-level queries (where got node copied to?) and
> internal queries (what are the copy dependencies?)
> are looking for the same information.
> -> This is linked to the previous. Noderev copy-to info
> can be useful to determine the dependency graph.
> The point here is that we are mixing queries on two
> different layers here. If we need both, they should
> both exist as separate APIs.
>
> * Noderevs changes are relevant for all svn obliterate.
> -> That may not always be true. If you only want to
> remove "unused" history, the references in the skip-
> delta tree express the relevant dependencies.
>
> * Noderev changes are the sufficient to answer the
> "where has this been copied to?" question.
> -> They are quite unhelpful here. Noderevs might
> help find you changes on the node itself (see first
> assumption). But you need to evaluate the user-
> level copy operations of all parent folders for all
> currently known copies. There are far less of these
> copies (e.g. branch creation) than noderev changes
> (typ. multiple per revision).
>
> * A noderev-based cache alone would reduce the
> amount of data to inspect / process for these queries.
> -> The noderev cache is *larger* than e.g. a log cache
> due to each change also modifying all parent nodes.
> To be efficient, you need a node-level copy-to cache
> as well (see previous item).
> >
> >>* "Where did path P at rev N to M directly or indirectly
> >> copied to, e.g. which releases contain a certain faulty
> >> code segment; optionally list changes to targets?"
> >>-> needs to scan parent paths for copies, too
> >> (reversed "log", revision graph)
> >Yes, the successor-id cache only gives us operation roots.
> >Information for child nodes needs to be derived -- it is not
> >within the scope of the cache itself.
> That is unnecessarily complicated / expensive.
> Using the proposed log cache, the relevant
> information can be found in a *single*, highly
> efficient scan.
> >
> >>It turns out that we can produce even humongous
> >>reverse logs (50+ k nodes, thousands of branches
> >>and tags) within a second by simply performing
> >>a full history scan.
> >>
> >>A example of how the whole process can be
> >>implemented efficiently, can be found here:
> >>
> >>https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> >I'll take a look at that, thanks!
> The gist of it is that you keep all currently known
> copy targets (e.g. branches, tags) in a tree. For
> each revision, you walk that tree to find the affected
> sub-tree. The LRU path gets optimized to speed up
> the next look-up.
>
> A usually fixed repo layout plus typical change
> patterns result in a quasi-constant processing
> time for each revision - independent from the
> number of copies already found.
>
> Things become much simpler and faster if you
> eliminate string operations and use path IDs
> instead. TSVN does an "svn log" at over 10M
> revs per second.
> >>>Storage of successor IDs is essentially a cache of the result of the
> >>>inversion of the predecessor relation between all node-revisions.
> >>>This inversion is a *very* expensive operation (follow every node
> >>>that ever existed in the repository from its most recent revision
> >>>back to the revision it was created in).
> >>Not true. At least the "*very* expensive" part.
> >>Log -v takes about 1min for AFS over ra_local.
> >>Building any of the index data described below
> >>can be done in< 10s.
> >Any of it (if so, which parts?), or all of it?
> All of it, of course. The expensive part is "svn log -v"
> which may be expensive on slow disks. Everything
> else is just writing a few MB of data.
> >>I propose a modified version of TSVN's log cache
> >>data structure. The adaptations:
> >>
> >>* split into multiple files to reduce modification overhead
> >>* remove rev-prop-dependent information (log comment,
> >> author, timestamp)
> >>* add the reverse copy info
> >>* simplify the data structures
> >This looks very interesting.
> >
> >What about FSFS-specific requirements?
> See assumptions above, this may require a different
> data structure. But I think that noderev dependencies
> will turn out to be redundant if you have a log cache
> and access to the skip-delta forwards dependencies.
> >It sounds like you avoid those by storing data in semantics of the repos
> >layer (path_at_revision) instead of the FS layer (node-revision-id)?
> Yes.
> >In this case separate implementations for FSFS and BDB aren't needed.
> >This could be an advantage (e.g. third party FS implementations
> >wouldn't need to change to support this).
> It also eliminates on of the performance weaknesses
> of SVN today: A log on some old / seldom changed
> path can take a very long time.
> >
> >I'll think about this some more, thanks.
> >
> Welcome ;)
>
> -- Stefan^2.
Received on 2011-09-20 23:41:39 CEST

This message: [ Message body ]
Next message: Neels J Hofmeyr: "Re: [patch] upgrade externals, URLs in EXTERNALS rows"
Previous message: Daniel Shahaf: "Re: implementing 'upgrade' for fs-successor-ids"
In reply to: Stefan Fuhrmann: "Re: server-side log cache"
Next in thread: Stefan Fuhrmann: "Re: server-side log cache"
Reply: Stefan Fuhrmann: "Re: server-side log cache"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]