I have read your mail several times. I think that I haven't explained
right how my implementation works. I've written a document explaining
step by step how is implemented the cache mechanism at this moment.
And I've also written another document explaining an alternative
algorithm. I would like you to read them.
Both documents are in HTML format. I'm waiting for your responses.
On Wed, Jul 9, 2008 at 4:16 PM, <Stefan.Fuhrmann_at_etas.com> wrote:
> On Tue, 8 Jul 2008 19:28:23 +0200, Alberto Gimeno <gimenete_at_gmail.com>
>> I also would like to know your opinions about the "possible
>> alternatives" I talked about in this post:
>> In that post I also talk about the cache general design.
> Hi Alberto,
> I think you are now ready for some constructive criticism ;)
> You just learned that efficient algorithms on large, attributed
> graphs are everything but simple. It's an active field of
> scientific research for a reason. Not that they are rocket
> science but those algorithms require careful planning and
> a deep understanding of the nature of the data they describe.
> Furthermore, using a RDMS to _process_ hierarchical data is
> a poor choice. But it is not a root of the problem. However,
> you still need some "data dump". And for that, a RDMS is as
> good a choice as any.
> That's for the criticism, now to the constructive part ;)
> As I understand it, you are storing for all files and folders
> the list of branches (plus revision ranges) that they are
> part of. Effectively, you pre-build most of the revision graph
> for every single node in the repository file system (minus
> those that are just copies of other nodes).
> Let N be the number of files in your project / component
> and R the number of revisions. Then there will be about R/B
> branches and tags with B being a project specific branching
> factor, e.g. "1 branch or tag every 100 revisions". Average
> folder depth is D is proportional to log N. Finally, C is
> the average number of changes per revision.
> For a medium-sized project this is N=1000, R=10000, B=100,
> C=10 and D=5. The problem is that a new branch (svn cp /trunk
> /branches/B) creates N new entries in your files table; total
> N*R/B = 100k entries. As R and N are roughly proportional,
> this results in quadratic dependency on the project "size".
> Ultimately, your cache update code will go berserk for larger
> However, adding 1000 new entries every 100 revisions should
> take less than 1 second. *Finding* all copy sources can be
> an expensive operation if there is no index for the path
> string. E.g. you are looking for "/trunk" <= x < "/trunk\0x1".
> Another weak spot is the construction of the graph itself.
> To find the changes for a given file ID, you must read all
> its entries from the files table (approx. R/B entries).
> For every such entry, the list of *relevant* revisions has
> to fetched from the revisions and / or change_paths table.
> One possible solution is a (file_entry_id -> change_id)
> mapping table.
> Building this table is expensive: there is at least one
> entry per file_table_id (it's "ADD" revision) plus there
> is one entry per change of *any* sub-path. That's R*C*D
> (500k) entries or about 600k total. What makes things worse
> is that a naive implementation will walk up the path for
> every single change and tries to find the corresponding
> file_entry_id. That's C*D queries per revision!
> BTW, how do you handle copy-from info?
> What you need to optimize for is to find all relevant data
> to draw a *single* revision graph. That is, fetch the data
> from the repository and transform it into some easy-to-access
> representation but do *not* add significant redundancy.
> Your idea of having file IDs is in line with SVN issue 1525.
> Keep it. If merge-tacking info gets added sometimes later,
> this abstraction will be very helpful.
> Modify the files table: add a column that links to the
> respective parent folder's entry. It is easy to see that this
> is possible and that the revision ranges do fully overlap.
> Add new entries only for those paths that are actually
> mentioned in the log changes. Hence, every copy adds only
> entry to the files table.
> To construct the list of all branches, aggregate the lists
> along the parent hierarchy. All parent-only entries in that
> list create a "branch" node and possibly a "deleted" node
> in the graph. All required information in already in the
> file list and no further look-up in the changes list is necessary.
> If that is not what you are already doing: Replace the paths
> in the changes table by references to the respective files
> table entry. For file nodes, this makes the retrieval trivial.
> Folder nodes are almost as simple to handle: Introduce a new
> table that represents the transitive inversion of the file
> path -> parent path column added to the files table. It also
> contains an entry for every file table row that points to
> the same row (i.e. is reflexive). The latter will also
> allow to run the same query for files and folders alike.
> So, it contains about D times more entries than the files table.
> For every path, all direct and indirect sub-paths that actually
> *do* have changes can be found efficiently. The changes table
> can then be JOINed directly.
> Based on the numbers collected by the TSVN log cache, you
> can expect the number entries in the files table to be 1/5th
> of the number of entries in the changes table. The new mapping
> table would be less than twice the number of entries than the
> changes table.
> While the result may still have its limitations, it should
> provide decent performance over a wide range of real-world
> -- Stefan^2.
> To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
> For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Alberto Gimeno Brieba
Presidente y fundador de
Ribe Software S.L.
página web: http://gimenete.net
teléfono móvil: +34 625 24 64 81
Received on 2008-07-22 21:50:46 CEST
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org