Thank you very much for your comments and suggestions. I've read your
mail but I'll read it again and again to improve my implementation.
I have just had a new idea... I need to think more about it but the idea is:
* store the log messages without processing them.
* analyze the log messages just for the selected file (find the origin
and then follow the copies). Until here is the 'alternative
implementation' (cache2.html). But instead of 'exploding' the 'change
paths' table, do the process in memory. The result of that process is
a single graph. That graph can be serializated to a file, so when the
user wants to see again the graph it is already calculated (most of
it, except updates).
In your conclusions you say that the current implementation is not
scalable ("Large histories will take days or even weeks to get
cached"). I agree that this is because of the 'exploding' issue. About
the second approach you say that "is likely to deliver a worse
"perceived" performance". I think you say that because this approach
needs to do more things when the user wants to see a graph. With the
current implementation all graphs are calculated at the same time.
With the 'alternative' approach this is different: the user needs to
wait more for each single graph.
However, what implementation do you think is the best? I think the
second one. Because it is scalable.
On Wed, Jul 23, 2008 at 3:37 AM, <Stefan.Fuhrmann_at_etas.com> wrote:
> Hi Alberto,
> please find below the result of my logorrhea attack ;)
> General remarks
> (1) Everything below this point comes with no warranty whatsoever ;)
> So, judge for yourself what statements actually apply to your
> situation and use-case.
> (2) Try your implementation against real-world repositories and
> collect performance data. Maybe, post the results on the list.
> For many users, you code may be "good enough" for now.
> (3) There is a batch file that creates a small but very nasty
> repository with all kinds of unpleasant copy / delete / replace
> relationships. If you master that, you are better than TSVN 1.5.0
> (there is a bug around rev. 19) ;)
> (user "guest", pw "")
> Before executing it, be sure to read the first lines (it deletes
> some paths on your disk!) and adjust them, if necessary.
> (4) I probably said this before but I just want to adjust expectations.
> Reaching TSVN-like performance is *very* hard, especially if
> you are not a guru-level JAVA expert. Not only that it requires
> sophisticated data structures (thousands LoC), optimizes according
> to not-so-obvious heuristics (lots of profiling, analysis, tweaking),
> it also requires a high degree of control over memory layout (even
> more code) to maximize cache locality. Over 400h hours total ...
> In other words, not exactly GSOC stuff.
> Alternative Implementation
> This is more or less equivalent to what TSVN does: Load the full
> log, find the origin of the path to analyze, forward-scan the log
> and follow all copies.
> I'm pretty sure that this is impossible to implement efficiently
> on a RDBMS. This is linked to (4) above. The problem is the following.
> There may be many copies (tags and branches) to follow. Mainly
> depending on what is being tagged (e.g. every nightly build),
> you should scale well up to 1000 copies.
> Since every change to a path is also a change to its parent (this
> is how "svn log" works as well), every entry in the "change paths"
> table may affect *any* of those many copies. Even more so as most
> changes will actually affect at least one of those paths if you
> show the graph for /trunk.
> The solution is as follows: organize the paths to follow as a tree
> themselves, store the common root of all changes of a given revision
> and use that to select the affected sub-set of "paths-to-follow".
> Analyze the changes for those few (usually 1) paths. Performance-
> critical operation is "IsPathSameOfParentOf".
> To analyze a KDE.org-sized repository in 2 seconds, TSVN spends
> less than 1.000 CPU ticks to analyze a single change. Even with 0.1
> msecs per query and only 3 queries / change you would still be
> 1000 times slower.
> The only way to speed that up would be the changes "flowing in"
> through some kind of callback and keeping everything else, including
> the added lines in RAM.
> Bottom line is: this is way out of scope for GSOC'08.
> Improvements to the current cache implementation
> Fortunately, I can see some ways to improve your current design.
> (1) Add a "derived" flag to the "change paths" table. It it shall
> be set for all rows that you added, i.e. were not part of the
> svn log output. This allows the implementation of an off-line
> log based on the same cache data.
> (2) Your table is incomplete. Revision 2, for instance, is also
> a modification to /trunk. In general, all entries cause M
> entries to all their parent paths.
> (3) Introduce a "paths" table that simply associates an ID to the
> actual path string. Use that ID instead of the paths itself
> in all other tables. You will save 50+% data.
> Your design basically matches my mental image I had from your earlier
> descriptions. The only surprise was the "exploded" "change paths"
> but this seems very reasonable given the simplicity of the final
> graph query.
> It's easy to see that this table will be the one with the most rows,
> thus determining the performance of your code. Maybe, there is a way
> to reduce it but I won't discuss that in this post.
> So, with 10k files and a branch / tag every 100 revisions, there
> will be about 100 rows per revision. A row has 5 key / number columns
> plus 2 flag columns. Including indices this gives about 32 bytes/row
> or ~4k per revision. Hence, your DB is at most a few GB and while
> building the cache @ ~50 revs/sec (typical rate for svn log on a fast
> repo connection) you add 200k per second. That sounds manageable.
> One critical operation is finding the sub-tree of paths that are
> affected by a given added or deleted path. The other is to find
> all parents to a given path. For those, I propose the following
> refinement of improvement (3) which is pretty much the same as
> the TSVN-internal structure:
> (3a) Add table "path elements": (ID, String), primary index on ID
> It acts as a dictionary for path and file names, i.e. those
> strings between "/".
> (3b) Add table "paths": (ID, parentID, elementID), primary index
> on ID, secondary index on parentID, secondary index on
> elementID. The secondary indexes are critical to the query
> performance and should be defined carefully (depending on the
> options provided by the RDBMS).
> It stores every path as a reference to its parent plus the
> tailing path element. The root path has an empty parentID.
> Note that a sub-path's ID is always > than the ID of its parent.
> Finding all parents of a given path is simple string processing
> and does not require DB support.
> Adding paths without history requires only two look-ups and one
> or two INSERT lines plus INSERTs to the "change paths" table.
> Since all relevant data is very cache-friendly, the look-ups should
> be fast. The INSERTs should be postponed until the first look-up fails.
> Adds account for about 20% of all changes, i.e. about 2 per rev.
> Adding a path with history require a temporary tables: start with
> the path and find all pathIDs that name it as parentID.
> Continue with all of them -> one query per folder tree level,
> i.e. < 10 rounds. Concatenate the results and create INSERTs.
> Note that the "path elements" may receive at most one INSERT
> (for the branch / tag name itself). Adds with history are rare:
> 1 branches / tags every 100 revs, 1 renames every 10 revs.
> Deletions perform the same sub-tree query as the add with history.
> It should be optimized the following way: first, check whether
> there are any sub-nodes at all.
> For reasons of symmetry, there should be about 20% deletions.
> Ordinary changes use query the paths table to translate the change
> paths into IDs. This is the case for the remaining 60% of the changes.
> Given ~10 changes / rev and ~50 revs / sec these are ~1000 individual
> DB operations / sec. Maybe possible, maybe not. My guess is that
> a cache filling rate of 10 revs / sec should be feasible.
> Alternative implementation
> Have a simple (ID, string) paths table with a secondary index on the
> string field. Use a single "LIKE" statement to query a sub-tree.
> While this reduces the number of queries, the size of the table grows
> significantly: an average path is about 100 chars. In that case, the
> paths table would be at least twice the size of the "change paths"
> table (4x the row size but typically referenced from only 2 rows).
> In-memory implementation
> Having the whole paths and path elements table content represented
> in memory is feasible w.r.t. to memory size (up to 1GB, usually much
> less). But loading it is hard.
> The current implementation has room for improvement by simply reducing
> the data size. Query performance should be o.k. even for large caches
> that require many disk accesses (100+ nodes / sec). Fill rate, however,
> remains limited by "exploding" the changes. Large histories will take
> days or even weeks to get cached. Disk space consumption should be
> acceptable for desktops and most notebooks.
> The proposed alternative implementation (cache2.html) is likely to
> deliver a worse "perceived" performance unless it is paired with a
> significant amount of in-memory processing.
> There may be an approach along the sketch in my previous post that
> circumvents the explosion issue. But I have to think about it.
> 3:30 AM. Time to go to bed now.
> -- Stefan^2.
> "Alberto Gimeno" <gimenete_at_gmail.com> wrote on 07/22/2008 09:50:26 PM:
>> Hello Stefan.
>> I have read your mail several times. I think that I haven't explained
>> right how my implementation works. I've written a document explaining
>> step by step how is implemented the cache mechanism at this moment.
>> And I've also written another document explaining an alternative
>> algorithm. I would like you to read them.
>> Both documents are in HTML format. I'm waiting for your responses.
>> On Wed, Jul 9, 2008 at 4:16 PM, <Stefan.Fuhrmann_at_etas.com> wrote:
>> > On Tue, 8 Jul 2008 19:28:23 +0200, Alberto Gimeno
>> > wrote:
>> >> I also would like to know your opinions about the "possible
>> >> alternatives" I talked about in this post:
>> >> http://gimenetegsoc.wordpress.com/2008/07/01/the-cache-design/
>> >> In that post I also talk about the cache general design.
>> > Hi Alberto,
>> > I think you are now ready for some constructive criticism ;)
>> > You just learned that efficient algorithms on large, attributed
>> > graphs are everything but simple. It's an active field of
>> > scientific research for a reason. Not that they are rocket
>> > science but those algorithms require careful planning and
>> > a deep understanding of the nature of the data they describe.
>> > Furthermore, using a RDMS to _process_ hierarchical data is
>> > a poor choice. But it is not a root of the problem. However,
>> > you still need some "data dump". And for that, a RDMS is as
>> > good a choice as any.
>> > That's for the criticism, now to the constructive part ;)
>> > Analysis:
>> > ---------
>> > As I understand it, you are storing for all files and folders
>> > the list of branches (plus revision ranges) that they are
>> > part of. Effectively, you pre-build most of the revision graph
>> > for every single node in the repository file system (minus
>> > those that are just copies of other nodes).
>> > Let N be the number of files in your project / component
>> > and R the number of revisions. Then there will be about R/B
>> > branches and tags with B being a project specific branching
>> > factor, e.g. "1 branch or tag every 100 revisions". Average
>> > folder depth is D is proportional to log N. Finally, C is
>> > the average number of changes per revision.
>> > For a medium-sized project this is N=1000, R=10000, B=100,
>> > C=10 and D=5. The problem is that a new branch (svn cp /trunk
>> > /branches/B) creates N new entries in your files table; total
>> > N*R/B = 100k entries. As R and N are roughly proportional,
>> > this results in quadratic dependency on the project "size".
>> > Ultimately, your cache update code will go berserk for larger
>> > projects.
>> > However, adding 1000 new entries every 100 revisions should
>> > take less than 1 second. *Finding* all copy sources can be
>> > an expensive operation if there is no index for the path
>> > string. E.g. you are looking for "/trunk" <= x < "/trunk\0x1".
>> > Another weak spot is the construction of the graph itself.
>> > To find the changes for a given file ID, you must read all
>> > its entries from the files table (approx. R/B entries).
>> > For every such entry, the list of *relevant* revisions has
>> > to fetched from the revisions and / or change_paths table.
>> > One possible solution is a (file_entry_id -> change_id)
>> > mapping table.
>> > Building this table is expensive: there is at least one
>> > entry per file_table_id (it's "ADD" revision) plus there
>> > is one entry per change of *any* sub-path. That's R*C*D
>> > (500k) entries or about 600k total. What makes things worse
>> > is that a naive implementation will walk up the path for
>> > every single change and tries to find the corresponding
>> > file_entry_id. That's C*D queries per revision!
>> > BTW, how do you handle copy-from info?
>> > Solution(s):
>> > ------------
>> > What you need to optimize for is to find all relevant data
>> > to draw a *single* revision graph. That is, fetch the data
>> > from the repository and transform it into some easy-to-access
>> > representation but do *not* add significant redundancy.
>> > Your idea of having file IDs is in line with SVN issue 1525.
>> > Keep it. If merge-tacking info gets added sometimes later,
>> > this abstraction will be very helpful.
>> > Modify the files table: add a column that links to the
>> > respective parent folder's entry. It is easy to see that this
>> > is possible and that the revision ranges do fully overlap.
>> > Add new entries only for those paths that are actually
>> > mentioned in the log changes. Hence, every copy adds only
>> > entry to the files table.
>> > To construct the list of all branches, aggregate the lists
>> > along the parent hierarchy. All parent-only entries in that
>> > list create a "branch" node and possibly a "deleted" node
>> > in the graph. All required information in already in the
>> > file list and no further look-up in the changes list is necessary.
>> > If that is not what you are already doing: Replace the paths
>> > in the changes table by references to the respective files
>> > table entry. For file nodes, this makes the retrieval trivial.
>> > Folder nodes are almost as simple to handle: Introduce a new
>> > table that represents the transitive inversion of the file
>> > path -> parent path column added to the files table. It also
>> > contains an entry for every file table row that points to
>> > the same row (i.e. is reflexive). The latter will also
>> > allow to run the same query for files and folders alike.
>> > So, it contains about D times more entries than the files table.
>> > For every path, all direct and indirect sub-paths that actually
>> > *do* have changes can be found efficiently. The changes table
>> > can then be JOINed directly.
>> > Based on the numbers collected by the TSVN log cache, you
>> > can expect the number entries in the files table to be 1/5th
>> > of the number of entries in the changes table. The new mapping
>> > table would be less than twice the number of entries than the
>> > changes table.
>> > While the result may still have its limitations, it should
>> > provide decent performance over a wide range of real-world
>> > projects.
>> > -- Stefan^2.
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
>> > For additional commands, e-mail: dev-help_at_subclipse.tigris.org
>> Alberto Gimeno Brieba
>> Presidente y fundador de
>> Ribe Software S.L.
>> Contacto personal
>> eMail: gimenete_at_gmail.com
>> GTalk: gimenete_at_gmail.com
>> msn: gimenete_at_hotmail.com
>> página web: http://gimenete.net
>> teléfono móvil: +34 625 24 64 81
>> | ETAS Mail Security - http://intranet.etasgroup.com/encryption |
>> | - The message was not encrypted and not digitally signed |
>> [Anhang "cache.html" gelöscht von Stefan Fuhrmann/DE/ETAS] [Anhang
> "cache2.html" gelöscht von
>> Stefan Fuhrmann/DE/ETAS]
>> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
>> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Alberto Gimeno Brieba
Presidente y fundador de
Ribe Software S.L.
página web: http://gimenete.net
teléfono móvil: +34 625 24 64 81
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Received on 2008-07-23 23:47:06 CEST