Re: [Subclipse-dev] Revision graph and cache implementation

From: Alberto Gimeno <gimenete_at_gmail.com>
Date: Tue, 22 Jul 2008 21:50:26 +0200

Hello Stefan.

I have read your mail several times. I think that I haven't explained
right how my implementation works. I've written a document explaining
step by step how is implemented the cache mechanism at this moment.
And I've also written another document explaining an alternative
algorithm. I would like you to read them.

Both documents are in HTML format. I'm waiting for your responses.

Thanks.

On Wed, Jul 9, 2008 at 4:16 PM, <Stefan.Fuhrmann_at_etas.com> wrote:
> On Tue, 8 Jul 2008 19:28:23 +0200, Alberto Gimeno <gimenete_at_gmail.com>
> wrote:
>
>> I also would like to know your opinions about the "possible
>> alternatives" I talked about in this post:
>> http://gimenetegsoc.wordpress.com/2008/07/01/the-cache-design/
>>
>> In that post I also talk about the cache general design.
>
> Hi Alberto,
>
> I think you are now ready for some constructive criticism ;)
>
> You just learned that efficient algorithms on large, attributed
> graphs are everything but simple. It's an active field of
> scientific research for a reason. Not that they are rocket
> science but those algorithms require careful planning and
> a deep understanding of the nature of the data they describe.
>
> Furthermore, using a RDMS to _process_ hierarchical data is
> a poor choice. But it is not a root of the problem. However,
> you still need some "data dump". And for that, a RDMS is as
> good a choice as any.
>
> That's for the criticism, now to the constructive part ;)
>
>
> Analysis:
> ---------
>
> As I understand it, you are storing for all files and folders
> the list of branches (plus revision ranges) that they are
> part of. Effectively, you pre-build most of the revision graph
> for every single node in the repository file system (minus
> those that are just copies of other nodes).
>
> Let N be the number of files in your project / component
> and R the number of revisions. Then there will be about R/B
> branches and tags with B being a project specific branching
> factor, e.g. "1 branch or tag every 100 revisions". Average
> folder depth is D is proportional to log N. Finally, C is
> the average number of changes per revision.
>
> For a medium-sized project this is N=1000, R=10000, B=100,
> C=10 and D=5. The problem is that a new branch (svn cp /trunk
> /branches/B) creates N new entries in your files table; total
> N*R/B = 100k entries. As R and N are roughly proportional,
> this results in quadratic dependency on the project "size".
> Ultimately, your cache update code will go berserk for larger
> projects.
>
> However, adding 1000 new entries every 100 revisions should
> take less than 1 second. *Finding* all copy sources can be
> an expensive operation if there is no index for the path
> string. E.g. you are looking for "/trunk" <= x < "/trunk\0x1".
>
> Another weak spot is the construction of the graph itself.
> To find the changes for a given file ID, you must read all
> its entries from the files table (approx. R/B entries).
> For every such entry, the list of *relevant* revisions has
> to fetched from the revisions and / or change_paths table.
> One possible solution is a (file_entry_id -> change_id)
> mapping table.
>
> Building this table is expensive: there is at least one
> entry per file_table_id (it's "ADD" revision) plus there
> is one entry per change of *any* sub-path. That's R*C*D
> (500k) entries or about 600k total. What makes things worse
> is that a naive implementation will walk up the path for
> every single change and tries to find the corresponding
> file_entry_id. That's C*D queries per revision!
>
> BTW, how do you handle copy-from info?
>
>
> Solution(s):
> ------------
>
> What you need to optimize for is to find all relevant data
> to draw a *single* revision graph. That is, fetch the data
> from the repository and transform it into some easy-to-access
> representation but do *not* add significant redundancy.
>
> Your idea of having file IDs is in line with SVN issue 1525.
> Keep it. If merge-tacking info gets added sometimes later,
> this abstraction will be very helpful.
>
> Modify the files table: add a column that links to the
> respective parent folder's entry. It is easy to see that this
> is possible and that the revision ranges do fully overlap.
> Add new entries only for those paths that are actually
> mentioned in the log changes. Hence, every copy adds only
> entry to the files table.
>
> To construct the list of all branches, aggregate the lists
> along the parent hierarchy. All parent-only entries in that
> list create a "branch" node and possibly a "deleted" node
> in the graph. All required information in already in the
> file list and no further look-up in the changes list is necessary.
>
> If that is not what you are already doing: Replace the paths
> in the changes table by references to the respective files
> table entry. For file nodes, this makes the retrieval trivial.
>
> Folder nodes are almost as simple to handle: Introduce a new
> table that represents the transitive inversion of the file
> path -> parent path column added to the files table. It also
> contains an entry for every file table row that points to
> the same row (i.e. is reflexive). The latter will also
> allow to run the same query for files and folders alike.
>
> So, it contains about D times more entries than the files table.
> For every path, all direct and indirect sub-paths that actually
> *do* have changes can be found efficiently. The changes table
> can then be JOINed directly.
>
> Based on the numbers collected by the TSVN log cache, you
> can expect the number entries in the files table to be 1/5th
> of the number of entries in the changes table. The new mapping
> table would be less than twice the number of entries than the
> changes table.
>
>
> While the result may still have its limitations, it should
> provide decent performance over a wide range of real-world
> projects.
>
> -- Stefan^2.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
>
>

-- 
Alberto Gimeno Brieba
Presidente y fundador de
Ribe Software S.L.
http://www.ribesoftware.com
ribe_at_ribesoftware.com
Contacto personal
eMail: gimenete_at_gmail.com
GTalk: gimenete_at_gmail.com
msn: gimenete_at_hotmail.com
página web: http://gimenete.net
teléfono móvil: +34 625 24 64 81

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org

text/html attachment: cache.html

text/html attachment: cache2.html

Received on 2008-07-22 21:50:46 CEST

This message: [ Message body ]
Next message: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"
Previous message: Alberto Gimeno: "Re: [Subclipse-dev] Comments on revision graph"
In reply to: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"
Next in thread: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"
Reply: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]