On 29.08.2011 18:35, Stefan Sperling wrote:
Sorry for the late response. I have been knocked
out for a while.
> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>> for what this is all about.
>> But the assumptions in that file are actually not valid.
> Which ones are invalid? Can you explain in detail?
It makes at least 5 implicit assumptions:
* Noderevs are the relevant level on which user-level
questions can be answered.
-> That might not be the case. Noderevs are a purely
internal construct. In particular, a noderev might get
copied but the actual delta is empty. Thinking about
svn obliterate & friends that might actually happen
as a result of some node removal / replacement.
* User-level queries (where got node copied to?) and
internal queries (what are the copy dependencies?)
are looking for the same information.
-> This is linked to the previous. Noderev copy-to info
can be useful to determine the dependency graph.
The point here is that we are mixing queries on two
different layers here. If we need both, they should
both exist as separate APIs.
* Noderevs changes are relevant for all svn obliterate.
-> That may not always be true. If you only want to
remove "unused" history, the references in the skip-
delta tree express the relevant dependencies.
* Noderev changes are the sufficient to answer the
"where has this been copied to?" question.
-> They are quite unhelpful here. Noderevs might
help find you changes on the node itself (see first
assumption). But you need to evaluate the user-
level copy operations of all parent folders for all
currently known copies. There are far less of these
copies (e.g. branch creation) than noderev changes
(typ. multiple per revision).
* A noderev-based cache alone would reduce the
amount of data to inspect / process for these queries.
-> The noderev cache is *larger* than e.g. a log cache
due to each change also modifying all parent nodes.
To be efficient, you need a node-level copy-to cache
as well (see previous item).
>
>> * "Where did path P at rev N to M directly or indirectly
>> copied to, e.g. which releases contain a certain faulty
>> code segment; optionally list changes to targets?"
>> -> needs to scan parent paths for copies, too
>> (reversed "log", revision graph)
> Yes, the successor-id cache only gives us operation roots.
> Information for child nodes needs to be derived -- it is not
> within the scope of the cache itself.
That is unnecessarily complicated / expensive.
Using the proposed log cache, the relevant
information can be found in a *single*, highly
efficient scan.
>
>> It turns out that we can produce even humongous
>> reverse logs (50+ k nodes, thousands of branches
>> and tags) within a second by simply performing
>> a full history scan.
>>
>> A example of how the whole process can be
>> implemented efficiently, can be found here:
>>
>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> I'll take a look at that, thanks!
The gist of it is that you keep all currently known
copy targets (e.g. branches, tags) in a tree. For
each revision, you walk that tree to find the affected
sub-tree. The LRU path gets optimized to speed up
the next look-up.
A usually fixed repo layout plus typical change
patterns result in a quasi-constant processing
time for each revision - independent from the
number of copies already found.
Things become much simpler and faster if you
eliminate string operations and use path IDs
instead. TSVN does an "svn log" at over 10M
revs per second.
>>> Storage of successor IDs is essentially a cache of the result of the
>>> inversion of the predecessor relation between all node-revisions.
>>> This inversion is a *very* expensive operation (follow every node
>>> that ever existed in the repository from its most recent revision
>>> back to the revision it was created in).
>> Not true. At least the "*very* expensive" part.
>> Log -v takes about 1min for AFS over ra_local.
>> Building any of the index data described below
>> can be done in< 10s.
> Any of it (if so, which parts?), or all of it?
All of it, of course. The expensive part is "svn log -v"
which may be expensive on slow disks. Everything
else is just writing a few MB of data.
>> I propose a modified version of TSVN's log cache
>> data structure. The adaptations:
>>
>> * split into multiple files to reduce modification overhead
>> * remove rev-prop-dependent information (log comment,
>> author, timestamp)
>> * add the reverse copy info
>> * simplify the data structures
> This looks very interesting.
>
> What about FSFS-specific requirements?
See assumptions above, this may require a different
data structure. But I think that noderev dependencies
will turn out to be redundant if you have a log cache
and access to the skip-delta forwards dependencies.
> It sounds like you avoid those by storing data in semantics of the repos
> layer (path_at_revision) instead of the FS layer (node-revision-id)?
Yes.
> In this case separate implementations for FSFS and BDB aren't needed.
> This could be an advantage (e.g. third party FS implementations
> wouldn't need to change to support this).
It also eliminates on of the performance weaknesses
of SVN today: A log on some old / seldom changed
path can take a very long time.
>
> I'll think about this some more, thanks.
>
Welcome ;)
-- Stefan^2.
Received on 2011-09-19 10:43:39 CEST