[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [Subclipse-dev] Revision graph and cache implementation

From: <Stefan.Fuhrmann_at_etas.com>
Date: Wed, 23 Jul 2008 03:37:42 +0200

Hi Alberto,

please find below the result of my logorrhea attack ;)

General remarks

(1) Everything below this point comes with no warranty whatsoever ;)
    So, judge for yourself what statements actually apply to your
    situation and use-case.

(2) Try your implementation against real-world repositories and
    collect performance data. Maybe, post the results on the list.
    For many users, you code may be "good enough" for now.

(3) There is a batch file that creates a small but very nasty
    repository with all kinds of unpleasant copy / delete / replace
    relationships. If you master that, you are better than TSVN 1.5.0
    (there is a bug around rev. 19) ;)


    (user "guest", pw "")

    Before executing it, be sure to read the first lines (it deletes
    some paths on your disk!) and adjust them, if necessary.

(4) I probably said this before but I just want to adjust expectations.
    Reaching TSVN-like performance is *very* hard, especially if
    you are not a guru-level JAVA expert. Not only that it requires
    sophisticated data structures (thousands LoC), optimizes according
    to not-so-obvious heuristics (lots of profiling, analysis, tweaking),
    it also requires a high degree of control over memory layout (even
    more code) to maximize cache locality. Over 400h hours total ...
    In other words, not exactly GSOC stuff.

Alternative Implementation

This is more or less equivalent to what TSVN does: Load the full
log, find the origin of the path to analyze, forward-scan the log
and follow all copies.

I'm pretty sure that this is impossible to implement efficiently
on a RDBMS. This is linked to (4) above. The problem is the following.

There may be many copies (tags and branches) to follow. Mainly
depending on what is being tagged (e.g. every nightly build),
you should scale well up to 1000 copies.

Since every change to a path is also a change to its parent (this
is how "svn log" works as well), every entry in the "change paths"
table may affect *any* of those many copies. Even more so as most
changes will actually affect at least one of those paths if you
show the graph for /trunk.

The solution is as follows: organize the paths to follow as a tree
themselves, store the common root of all changes of a given revision
and use that to select the affected sub-set of "paths-to-follow".
Analyze the changes for those few (usually 1) paths. Performance-
critical operation is "IsPathSameOfParentOf".

To analyze a KDE.org-sized repository in 2 seconds, TSVN spends
less than 1.000 CPU ticks to analyze a single change. Even with 0.1
msecs per query and only 3 queries / change you would still be
1000 times slower.

The only way to speed that up would be the changes "flowing in"
through some kind of callback and keeping everything else, including
the added lines in RAM.

Bottom line is: this is way out of scope for GSOC'08.

Improvements to the current cache implementation

Fortunately, I can see some ways to improve your current design.

(1) Add a "derived" flag to the "change paths" table. It it shall
    be set for all rows that you added, i.e. were not part of the
    svn log output. This allows the implementation of an off-line
    log based on the same cache data.

(2) Your table is incomplete. Revision 2, for instance, is also
    a modification to /trunk. In general, all entries cause M
    entries to all their parent paths.

(3) Introduce a "paths" table that simply associates an ID to the
    actual path string. Use that ID instead of the paths itself
    in all other tables. You will save 50+% data.

Your design basically matches my mental image I had from your earlier
descriptions. The only surprise was the "exploded" "change paths"
but this seems very reasonable given the simplicity of the final
graph query.

It's easy to see that this table will be the one with the most rows,
thus determining the performance of your code. Maybe, there is a way
to reduce it but I won't discuss that in this post.

So, with 10k files and a branch / tag every 100 revisions, there
will be about 100 rows per revision. A row has 5 key / number columns
plus 2 flag columns. Including indices this gives about 32 bytes/row
or ~4k per revision. Hence, your DB is at most a few GB and while
building the cache @ ~50 revs/sec (typical rate for svn log on a fast
repo connection) you add 200k per second. That sounds manageable.

One critical operation is finding the sub-tree of paths that are
affected by a given added or deleted path. The other is to find
all parents to a given path. For those, I propose the following
refinement of improvement (3) which is pretty much the same as
the TSVN-internal structure:

(3a) Add table "path elements": (ID, String), primary index on ID
     It acts as a dictionary for path and file names, i.e. those
     strings between "/".

(3b) Add table "paths": (ID, parentID, elementID), primary index
     on ID, secondary index on parentID, secondary index on
     elementID. The secondary indexes are critical to the query
     performance and should be defined carefully (depending on the
     options provided by the RDBMS).
     It stores every path as a reference to its parent plus the
     tailing path element. The root path has an empty parentID.

     Note that a sub-path's ID is always > than the ID of its parent.

Finding all parents of a given path is simple string processing
and does not require DB support.

Adding paths without history requires only two look-ups and one
or two INSERT lines plus INSERTs to the "change paths" table.
Since all relevant data is very cache-friendly, the look-ups should
be fast. The INSERTs should be postponed until the first look-up fails.
Adds account for about 20% of all changes, i.e. about 2 per rev.

Adding a path with history require a temporary tables: start with
the path and find all pathIDs that name it as parentID.
Continue with all of them -> one query per folder tree level,
i.e. < 10 rounds. Concatenate the results and create INSERTs.
Note that the "path elements" may receive at most one INSERT
(for the branch / tag name itself). Adds with history are rare:
1 branches / tags every 100 revs, 1 renames every 10 revs.

Deletions perform the same sub-tree query as the add with history.
It should be optimized the following way: first, check whether
there are any sub-nodes at all.
For reasons of symmetry, there should be about 20% deletions.

Ordinary changes use query the paths table to translate the change
paths into IDs. This is the case for the remaining 60% of the changes.

Given ~10 changes / rev and ~50 revs / sec these are ~1000 individual
DB operations / sec. Maybe possible, maybe not. My guess is that
a cache filling rate of 10 revs / sec should be feasible.

Alternative implementation

Have a simple (ID, string) paths table with a secondary index on the
string field. Use a single "LIKE" statement to query a sub-tree.

While this reduces the number of queries, the size of the table grows
significantly: an average path is about 100 chars. In that case, the
paths table would be at least twice the size of the "change paths"
table (4x the row size but typically referenced from only 2 rows).

In-memory implementation

Having the whole paths and path elements table content represented
in memory is feasible w.r.t. to memory size (up to 1GB, usually much
less). But loading it is hard.


The current implementation has room for improvement by simply reducing
the data size. Query performance should be o.k. even for large caches
that require many disk accesses (100+ nodes / sec). Fill rate, however,
remains limited by "exploding" the changes. Large histories will take
days or even weeks to get cached. Disk space consumption should be
acceptable for desktops and most notebooks.

The proposed alternative implementation (cache2.html) is likely to
deliver a worse "perceived" performance unless it is paired with a
significant amount of in-memory processing.

There may be an approach along the sketch in my previous post that
circumvents the explosion issue. But I have to think about it.

3:30 AM. Time to go to bed now.

-- Stefan^2.

"Alberto Gimeno" <gimenete_at_gmail.com> wrote on 07/22/2008 09:50:26 PM:

> Hello Stefan.
> I have read your mail several times. I think that I haven't explained
> right how my implementation works. I've written a document explaining
> step by step how is implemented the cache mechanism at this moment.
> And I've also written another document explaining an alternative
> algorithm. I would like you to read them.
> Both documents are in HTML format. I'm waiting for your responses.
> Thanks.
> On Wed, Jul 9, 2008 at 4:16 PM, <Stefan.Fuhrmann_at_etas.com> wrote:
> > On Tue, 8 Jul 2008 19:28:23 +0200, Alberto Gimeno
> > wrote:
> >
> >> I also would like to know your opinions about the "possible
> >> alternatives" I talked about in this post:
> >> http://gimenetegsoc.wordpress.com/2008/07/01/the-cache-design/
> >>
> >> In that post I also talk about the cache general design.
> >
> > Hi Alberto,
> >
> > I think you are now ready for some constructive criticism ;)
> >
> > You just learned that efficient algorithms on large, attributed
> > graphs are everything but simple. It's an active field of
> > scientific research for a reason. Not that they are rocket
> > science but those algorithms require careful planning and
> > a deep understanding of the nature of the data they describe.
> >
> > Furthermore, using a RDMS to _process_ hierarchical data is
> > a poor choice. But it is not a root of the problem. However,
> > you still need some "data dump". And for that, a RDMS is as
> > good a choice as any.
> >
> > That's for the criticism, now to the constructive part ;)
> >
> >
> > Analysis:
> > ---------
> >
> > As I understand it, you are storing for all files and folders
> > the list of branches (plus revision ranges) that they are
> > part of. Effectively, you pre-build most of the revision graph
> > for every single node in the repository file system (minus
> > those that are just copies of other nodes).
> >
> > Let N be the number of files in your project / component
> > and R the number of revisions. Then there will be about R/B
> > branches and tags with B being a project specific branching
> > factor, e.g. "1 branch or tag every 100 revisions". Average
> > folder depth is D is proportional to log N. Finally, C is
> > the average number of changes per revision.
> >
> > For a medium-sized project this is N=1000, R=10000, B=100,
> > C=10 and D=5. The problem is that a new branch (svn cp /trunk
> > /branches/B) creates N new entries in your files table; total
> > N*R/B = 100k entries. As R and N are roughly proportional,
> > this results in quadratic dependency on the project "size".
> > Ultimately, your cache update code will go berserk for larger
> > projects.
> >
> > However, adding 1000 new entries every 100 revisions should
> > take less than 1 second. *Finding* all copy sources can be
> > an expensive operation if there is no index for the path
> > string. E.g. you are looking for "/trunk" <= x < "/trunk\0x1".
> >
> > Another weak spot is the construction of the graph itself.
> > To find the changes for a given file ID, you must read all
> > its entries from the files table (approx. R/B entries).
> > For every such entry, the list of *relevant* revisions has
> > to fetched from the revisions and / or change_paths table.
> > One possible solution is a (file_entry_id -> change_id)
> > mapping table.
> >
> > Building this table is expensive: there is at least one
> > entry per file_table_id (it's "ADD" revision) plus there
> > is one entry per change of *any* sub-path. That's R*C*D
> > (500k) entries or about 600k total. What makes things worse
> > is that a naive implementation will walk up the path for
> > every single change and tries to find the corresponding
> > file_entry_id. That's C*D queries per revision!
> >
> > BTW, how do you handle copy-from info?
> >
> >
> > Solution(s):
> > ------------
> >
> > What you need to optimize for is to find all relevant data
> > to draw a *single* revision graph. That is, fetch the data
> > from the repository and transform it into some easy-to-access
> > representation but do *not* add significant redundancy.
> >
> > Your idea of having file IDs is in line with SVN issue 1525.
> > Keep it. If merge-tacking info gets added sometimes later,
> > this abstraction will be very helpful.
> >
> > Modify the files table: add a column that links to the
> > respective parent folder's entry. It is easy to see that this
> > is possible and that the revision ranges do fully overlap.
> > Add new entries only for those paths that are actually
> > mentioned in the log changes. Hence, every copy adds only
> > entry to the files table.
> >
> > To construct the list of all branches, aggregate the lists
> > along the parent hierarchy. All parent-only entries in that
> > list create a "branch" node and possibly a "deleted" node
> > in the graph. All required information in already in the
> > file list and no further look-up in the changes list is necessary.
> >
> > If that is not what you are already doing: Replace the paths
> > in the changes table by references to the respective files
> > table entry. For file nodes, this makes the retrieval trivial.
> >
> > Folder nodes are almost as simple to handle: Introduce a new
> > table that represents the transitive inversion of the file
> > path -> parent path column added to the files table. It also
> > contains an entry for every file table row that points to
> > the same row (i.e. is reflexive). The latter will also
> > allow to run the same query for files and folders alike.
> >
> > So, it contains about D times more entries than the files table.
> > For every path, all direct and indirect sub-paths that actually
> > *do* have changes can be found efficiently. The changes table
> > can then be JOINed directly.
> >
> > Based on the numbers collected by the TSVN log cache, you
> > can expect the number entries in the files table to be 1/5th
> > of the number of entries in the changes table. The new mapping
> > table would be less than twice the number of entries than the
> > changes table.
> >
> >
> > While the result may still have its limitations, it should
> > provide decent performance over a wide range of real-world
> > projects.
> >
> > -- Stefan^2.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
> > For additional commands, e-mail: dev-help_at_subclipse.tigris.org
> >
> >
> --
> Alberto Gimeno Brieba
> Presidente y fundador de
> Ribe Software S.L.
> http://www.ribesoftware.com
> ribe_at_ribesoftware.com
> Contacto personal
> eMail: gimenete_at_gmail.com
> GTalk: gimenete_at_gmail.com
> msn: gimenete_at_hotmail.com
> página web: http://gimenete.net
> teléfono móvil: +34 625 24 64 81
> +----------------------------------------------------------------------+
> | ETAS Mail Security - http://intranet.etasgroup.com/encryption |
> +----------------------------------------------------------------------+
> | - The message was not encrypted and not digitally signed |
> +----------------------------------------------------------------------+
> [Anhang "cache.html" gelöscht von Stefan Fuhrmann/DE/ETAS] [Anhang
"cache2.html" gelöscht von
> Stefan Fuhrmann/DE/ETAS]
> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
> For additional commands, e-mail: dev-help_at_subclipse.tigris.org

To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Received on 2008-07-23 03:38:00 CEST

This is an archived mail posted to the Subclipse Dev mailing list.