Re: [Subclipse-dev] Revision graph and cache implementation

From: Alberto Gimeno <gimenete_at_gmail.com>
Date: Mon, 28 Jul 2008 16:10:59 +0200

Hi all.

I just want to say that I am at Valencia (Spain), at the Campus Party
2008. Google invited me to this event. Here I'll talk about my
experience at the Google Summer of Code. More info:
http://gimenetegsoc.wordpress.com/2008/07/28/hello-from-the-campus-party-2008-valencia-spain/

I'm going to spend this week working in the revision graph. I'm going
to make some experiments with some of the ideas suggested by Stefan. I
hope to have a better cache implementation at the end of this week.

On Wed, Jul 23, 2008 at 11:46 PM, Alberto Gimeno <gimenete_at_gmail.com> wrote:
> Hi Stefan.
>
> Thank you very much for your comments and suggestions. I've read your
> mail but I'll read it again and again to improve my implementation.
>
> I have just had a new idea... I need to think more about it but the idea is:
> * store the log messages without processing them.
> * analyze the log messages just for the selected file (find the origin
> and then follow the copies). Until here is the 'alternative
> implementation' (cache2.html). But instead of 'exploding' the 'change
> paths' table, do the process in memory. The result of that process is
> a single graph. That graph can be serializated to a file, so when the
> user wants to see again the graph it is already calculated (most of
> it, except updates).
>
> In your conclusions you say that the current implementation is not
> scalable ("Large histories will take days or even weeks to get
> cached"). I agree that this is because of the 'exploding' issue. About
> the second approach you say that "is likely to deliver a worse
> "perceived" performance". I think you say that because this approach
> needs to do more things when the user wants to see a graph. With the
> current implementation all graphs are calculated at the same time.
> With the 'alternative' approach this is different: the user needs to
> wait more for each single graph.
>
> However, what implementation do you think is the best? I think the
> second one. Because it is scalable.
>
>
>
> On Wed, Jul 23, 2008 at 3:37 AM, <Stefan.Fuhrmann_at_etas.com> wrote:
>> Hi Alberto,
>>
>> please find below the result of my logorrhea attack ;)
>>
>>
>> General remarks
>> ===============
>>
>> (1) Everything below this point comes with no warranty whatsoever ;)
>> So, judge for yourself what statements actually apply to your
>> situation and use-case.
>>
>> (2) Try your implementation against real-world repositories and
>> collect performance data. Maybe, post the results on the list.
>> For many users, you code may be "good enough" for now.
>>
>> (3) There is a batch file that creates a small but very nasty
>> repository with all kinds of unpleasant copy / delete / replace
>> relationships. If you master that, you are better than TSVN 1.5.0
>> (there is a bug around rev. 19) ;)
>>
>>
>> http://tortoisesvn.tigris.org/svn/tortoisesvn/branches/RefactorRevGraph/test/BuildRevisionGraphTestRepo.bat
>>
>> (user "guest", pw "")
>>
>> Before executing it, be sure to read the first lines (it deletes
>> some paths on your disk!) and adjust them, if necessary.
>>
>> (4) I probably said this before but I just want to adjust expectations.
>> Reaching TSVN-like performance is *very* hard, especially if
>> you are not a guru-level JAVA expert. Not only that it requires
>> sophisticated data structures (thousands LoC), optimizes according
>> to not-so-obvious heuristics (lots of profiling, analysis, tweaking),
>> it also requires a high degree of control over memory layout (even
>> more code) to maximize cache locality. Over 400h hours total ...
>> In other words, not exactly GSOC stuff.
>>
>>
>>
>> Alternative Implementation
>> ==========================
>>
>> This is more or less equivalent to what TSVN does: Load the full
>> log, find the origin of the path to analyze, forward-scan the log
>> and follow all copies.
>>
>> I'm pretty sure that this is impossible to implement efficiently
>> on a RDBMS. This is linked to (4) above. The problem is the following.
>>
>> There may be many copies (tags and branches) to follow. Mainly
>> depending on what is being tagged (e.g. every nightly build),
>> you should scale well up to 1000 copies.
>>
>> Since every change to a path is also a change to its parent (this
>> is how "svn log" works as well), every entry in the "change paths"
>> table may affect *any* of those many copies. Even more so as most
>> changes will actually affect at least one of those paths if you
>> show the graph for /trunk.
>>
>> The solution is as follows: organize the paths to follow as a tree
>> themselves, store the common root of all changes of a given revision
>> and use that to select the affected sub-set of "paths-to-follow".
>> Analyze the changes for those few (usually 1) paths. Performance-
>> critical operation is "IsPathSameOfParentOf".
>>
>> To analyze a KDE.org-sized repository in 2 seconds, TSVN spends
>> less than 1.000 CPU ticks to analyze a single change. Even with 0.1
>> msecs per query and only 3 queries / change you would still be
>> 1000 times slower.
>>
>> The only way to speed that up would be the changes "flowing in"
>> through some kind of callback and keeping everything else, including
>> the added lines in RAM.
>>
>> Bottom line is: this is way out of scope for GSOC'08.
>>
>>
>>
>> Improvements to the current cache implementation
>> ================================================
>>
>> Fortunately, I can see some ways to improve your current design.
>>
>> (1) Add a "derived" flag to the "change paths" table. It it shall
>> be set for all rows that you added, i.e. were not part of the
>> svn log output. This allows the implementation of an off-line
>> log based on the same cache data.
>>
>> (2) Your table is incomplete. Revision 2, for instance, is also
>> a modification to /trunk. In general, all entries cause M
>> entries to all their parent paths.
>>
>> (3) Introduce a "paths" table that simply associates an ID to the
>> actual path string. Use that ID instead of the paths itself
>> in all other tables. You will save 50+% data.
>>
>> Your design basically matches my mental image I had from your earlier
>> descriptions. The only surprise was the "exploded" "change paths"
>> but this seems very reasonable given the simplicity of the final
>> graph query.
>>
>> It's easy to see that this table will be the one with the most rows,
>> thus determining the performance of your code. Maybe, there is a way
>> to reduce it but I won't discuss that in this post.
>>
>> So, with 10k files and a branch / tag every 100 revisions, there
>> will be about 100 rows per revision. A row has 5 key / number columns
>> plus 2 flag columns. Including indices this gives about 32 bytes/row
>> or ~4k per revision. Hence, your DB is at most a few GB and while
>> building the cache @ ~50 revs/sec (typical rate for svn log on a fast
>> repo connection) you add 200k per second. That sounds manageable.
>>
>> One critical operation is finding the sub-tree of paths that are
>> affected by a given added or deleted path. The other is to find
>> all parents to a given path. For those, I propose the following
>> refinement of improvement (3) which is pretty much the same as
>> the TSVN-internal structure:
>>
>> (3a) Add table "path elements": (ID, String), primary index on ID
>> It acts as a dictionary for path and file names, i.e. those
>> strings between "/".
>>
>> (3b) Add table "paths": (ID, parentID, elementID), primary index
>> on ID, secondary index on parentID, secondary index on
>> elementID. The secondary indexes are critical to the query
>> performance and should be defined carefully (depending on the
>> options provided by the RDBMS).
>> It stores every path as a reference to its parent plus the
>> tailing path element. The root path has an empty parentID.
>>
>> Note that a sub-path's ID is always > than the ID of its parent.
>>
>> Finding all parents of a given path is simple string processing
>> and does not require DB support.
>>
>> Adding paths without history requires only two look-ups and one
>> or two INSERT lines plus INSERTs to the "change paths" table.
>> Since all relevant data is very cache-friendly, the look-ups should
>> be fast. The INSERTs should be postponed until the first look-up fails.
>> Adds account for about 20% of all changes, i.e. about 2 per rev.
>>
>> Adding a path with history require a temporary tables: start with
>> the path and find all pathIDs that name it as parentID.
>> Continue with all of them -> one query per folder tree level,
>> i.e. < 10 rounds. Concatenate the results and create INSERTs.
>> Note that the "path elements" may receive at most one INSERT
>> (for the branch / tag name itself). Adds with history are rare:
>> 1 branches / tags every 100 revs, 1 renames every 10 revs.
>>
>> Deletions perform the same sub-tree query as the add with history.
>> It should be optimized the following way: first, check whether
>> there are any sub-nodes at all.
>> For reasons of symmetry, there should be about 20% deletions.
>>
>> Ordinary changes use query the paths table to translate the change
>> paths into IDs. This is the case for the remaining 60% of the changes.
>>
>> Given ~10 changes / rev and ~50 revs / sec these are ~1000 individual
>> DB operations / sec. Maybe possible, maybe not. My guess is that
>> a cache filling rate of 10 revs / sec should be feasible.
>>
>>
>> Alternative implementation
>> --------------------------
>>
>> Have a simple (ID, string) paths table with a secondary index on the
>> string field. Use a single "LIKE" statement to query a sub-tree.
>>
>> While this reduces the number of queries, the size of the table grows
>> significantly: an average path is about 100 chars. In that case, the
>> paths table would be at least twice the size of the "change paths"
>> table (4x the row size but typically referenced from only 2 rows).
>>
>>
>> In-memory implementation
>> ------------------------
>>
>> Having the whole paths and path elements table content represented
>> in memory is feasible w.r.t. to memory size (up to 1GB, usually much
>> less). But loading it is hard.
>>
>>
>> Conclusion
>> ==========
>>
>> The current implementation has room for improvement by simply reducing
>> the data size. Query performance should be o.k. even for large caches
>> that require many disk accesses (100+ nodes / sec). Fill rate, however,
>> remains limited by "exploding" the changes. Large histories will take
>> days or even weeks to get cached. Disk space consumption should be
>> acceptable for desktops and most notebooks.
>>
>> The proposed alternative implementation (cache2.html) is likely to
>> deliver a worse "perceived" performance unless it is paired with a
>> significant amount of in-memory processing.
>>
>> There may be an approach along the sketch in my previous post that
>> circumvents the explosion issue. But I have to think about it.
>>
>>
>> 3:30 AM. Time to go to bed now.
>>
>> -- Stefan^2.
>>
>>
>>
>> "Alberto Gimeno" <gimenete_at_gmail.com> wrote on 07/22/2008 09:50:26 PM:
>>
>>> Hello Stefan.
>>>
>>> I have read your mail several times. I think that I haven't explained
>>> right how my implementation works. I've written a document explaining
>>> step by step how is implemented the cache mechanism at this moment.
>>> And I've also written another document explaining an alternative
>>> algorithm. I would like you to read them.
>>>
>>> Both documents are in HTML format. I'm waiting for your responses.
>>>
>>> Thanks.
>>>
>>> On Wed, Jul 9, 2008 at 4:16 PM, <Stefan.Fuhrmann_at_etas.com> wrote:
>>> > On Tue, 8 Jul 2008 19:28:23 +0200, Alberto Gimeno
>> <gimenete_at_gmail.com>
>>> > wrote:
>>> >
>>> >> I also would like to know your opinions about the "possible
>>> >> alternatives" I talked about in this post:
>>> >> http://gimenetegsoc.wordpress.com/2008/07/01/the-cache-design/
>>> >>
>>> >> In that post I also talk about the cache general design.
>>> >
>>> > Hi Alberto,
>>> >
>>> > I think you are now ready for some constructive criticism ;)
>>> >
>>> > You just learned that efficient algorithms on large, attributed
>>> > graphs are everything but simple. It's an active field of
>>> > scientific research for a reason. Not that they are rocket
>>> > science but those algorithms require careful planning and
>>> > a deep understanding of the nature of the data they describe.
>>> >
>>> > Furthermore, using a RDMS to _process_ hierarchical data is
>>> > a poor choice. But it is not a root of the problem. However,
>>> > you still need some "data dump". And for that, a RDMS is as
>>> > good a choice as any.
>>> >
>>> > That's for the criticism, now to the constructive part ;)
>>> >
>>> >
>>> > Analysis:
>>> > ---------
>>> >
>>> > As I understand it, you are storing for all files and folders
>>> > the list of branches (plus revision ranges) that they are
>>> > part of. Effectively, you pre-build most of the revision graph
>>> > for every single node in the repository file system (minus
>>> > those that are just copies of other nodes).
>>> >
>>> > Let N be the number of files in your project / component
>>> > and R the number of revisions. Then there will be about R/B
>>> > branches and tags with B being a project specific branching
>>> > factor, e.g. "1 branch or tag every 100 revisions". Average
>>> > folder depth is D is proportional to log N. Finally, C is
>>> > the average number of changes per revision.
>>> >
>>> > For a medium-sized project this is N=1000, R=10000, B=100,
>>> > C=10 and D=5. The problem is that a new branch (svn cp /trunk
>>> > /branches/B) creates N new entries in your files table; total
>>> > N*R/B = 100k entries. As R and N are roughly proportional,
>>> > this results in quadratic dependency on the project "size".
>>> > Ultimately, your cache update code will go berserk for larger
>>> > projects.
>>> >
>>> > However, adding 1000 new entries every 100 revisions should
>>> > take less than 1 second. *Finding* all copy sources can be
>>> > an expensive operation if there is no index for the path
>>> > string. E.g. you are looking for "/trunk" <= x < "/trunk\0x1".
>>> >
>>> > Another weak spot is the construction of the graph itself.
>>> > To find the changes for a given file ID, you must read all
>>> > its entries from the files table (approx. R/B entries).
>>> > For every such entry, the list of *relevant* revisions has
>>> > to fetched from the revisions and / or change_paths table.
>>> > One possible solution is a (file_entry_id -> change_id)
>>> > mapping table.
>>> >
>>> > Building this table is expensive: there is at least one
>>> > entry per file_table_id (it's "ADD" revision) plus there
>>> > is one entry per change of *any* sub-path. That's R*C*D
>>> > (500k) entries or about 600k total. What makes things worse
>>> > is that a naive implementation will walk up the path for
>>> > every single change and tries to find the corresponding
>>> > file_entry_id. That's C*D queries per revision!
>>> >
>>> > BTW, how do you handle copy-from info?
>>> >
>>> >
>>> > Solution(s):
>>> > ------------
>>> >
>>> > What you need to optimize for is to find all relevant data
>>> > to draw a *single* revision graph. That is, fetch the data
>>> > from the repository and transform it into some easy-to-access
>>> > representation but do *not* add significant redundancy.
>>> >
>>> > Your idea of having file IDs is in line with SVN issue 1525.
>>> > Keep it. If merge-tacking info gets added sometimes later,
>>> > this abstraction will be very helpful.
>>> >
>>> > Modify the files table: add a column that links to the
>>> > respective parent folder's entry. It is easy to see that this
>>> > is possible and that the revision ranges do fully overlap.
>>> > Add new entries only for those paths that are actually
>>> > mentioned in the log changes. Hence, every copy adds only
>>> > entry to the files table.
>>> >
>>> > To construct the list of all branches, aggregate the lists
>>> > along the parent hierarchy. All parent-only entries in that
>>> > list create a "branch" node and possibly a "deleted" node
>>> > in the graph. All required information in already in the
>>> > file list and no further look-up in the changes list is necessary.
>>> >
>>> > If that is not what you are already doing: Replace the paths
>>> > in the changes table by references to the respective files
>>> > table entry. For file nodes, this makes the retrieval trivial.
>>> >
>>> > Folder nodes are almost as simple to handle: Introduce a new
>>> > table that represents the transitive inversion of the file
>>> > path -> parent path column added to the files table. It also
>>> > contains an entry for every file table row that points to
>>> > the same row (i.e. is reflexive). The latter will also
>>> > allow to run the same query for files and folders alike.
>>> >
>>> > So, it contains about D times more entries than the files table.
>>> > For every path, all direct and indirect sub-paths that actually
>>> > *do* have changes can be found efficiently. The changes table
>>> > can then be JOINed directly.
>>> >
>>> > Based on the numbers collected by the TSVN log cache, you
>>> > can expect the number entries in the files table to be 1/5th
>>> > of the number of entries in the changes table. The new mapping
>>> > table would be less than twice the number of entries than the
>>> > changes table.
>>> >
>>> >
>>> > While the result may still have its limitations, it should
>>> > provide decent performance over a wide range of real-world
>>> > projects.
>>> >
>>> > -- Stefan^2.
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
>>> > For additional commands, e-mail: dev-help_at_subclipse.tigris.org
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Alberto Gimeno Brieba
>>> Presidente y fundador de
>>> Ribe Software S.L.
>>> http://www.ribesoftware.com
>>> ribe_at_ribesoftware.com
>>>
>>> Contacto personal
>>> eMail: gimenete_at_gmail.com
>>> GTalk: gimenete_at_gmail.com
>>> msn: gimenete_at_hotmail.com
>>> página web: http://gimenete.net
>>> teléfono móvil: +34 625 24 64 81
>>>
>>>
>>> +----------------------------------------------------------------------+
>>> | ETAS Mail Security - http://intranet.etasgroup.com/encryption |
>>> +----------------------------------------------------------------------+
>>> | - The message was not encrypted and not digitally signed |
>>> +----------------------------------------------------------------------+
>>> [Anhang "cache.html" gelöscht von Stefan Fuhrmann/DE/ETAS] [Anhang
>> "cache2.html" gelöscht von
>>> Stefan Fuhrmann/DE/ETAS]
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
>>> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
>> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
>>
>>
>
>
>
> --
> Alberto Gimeno Brieba
> Presidente y fundador de
> Ribe Software S.L.
> http://www.ribesoftware.com
> ribe_at_ribesoftware.com
>
> Contacto personal
> eMail: gimenete_at_gmail.com
> GTalk: gimenete_at_gmail.com
> msn: gimenete_at_hotmail.com
> página web: http://gimenete.net
> teléfono móvil: +34 625 24 64 81
>

-- 
Alberto Gimeno Brieba
Presidente y fundador de
Ribe Software S.L.
http://www.ribesoftware.com
ribe_at_ribesoftware.com
Contacto personal
eMail: gimenete_at_gmail.com
GTalk: gimenete_at_gmail.com
msn: gimenete_at_hotmail.com
página web: http://gimenete.net
teléfono móvil: +34 625 24 64 81
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org

Received on 2008-07-28 16:11:26 CEST

This message: [ Message body ]
Next message: Mark Phippard: "Re: [Subclipse-dev] Revision graph and cache implementation"
Previous message: Mark Phippard: "[Subclipse-dev] Subclipse 1.4.2 Released"
In reply to: Alberto Gimeno: "Re: [Subclipse-dev] Revision graph and cache implementation"
Next in thread: Mark Phippard: "Re: [Subclipse-dev] Revision graph and cache implementation"
Reply: Mark Phippard: "Re: [Subclipse-dev] Revision graph and cache implementation"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]