Suran Jayathilaka <suranjay_at_gmail.com> wrote on 25.03.2008 13:04:28:
> Hi all,
>
> I am a student intending to take part in GSoC 2008. I am interested in
> the subclipse project of adding Revision Graph support for subclipse.
Good choice ;) It's an exciting and challenging task.
Exciting because it is very visual but also demanding
because limiting the computational complexity can become
a nontrivial task.
Anyway, while you won't be able to apply particular
algorithms from your 'Algorithms on Graphs' lectures,
you will soon develop some grasp of which approaches
are efficient and which ones are not.
> A major issue presenting itself in this regards is the task of
> retrieving the necessary information from the repository for building
> the revision graph. Is there any documentation available for the
> algorithm used to achieve this? Any feedback would be greatly
> appreciated.
Stefan (Küng) already provided you with a basic sketch
of what has to be done.
Since I'm the person that implemented the current
revision graph, I would still like to share my experiences
with you so that you won't run into the same pitfalls. As
Stefan already wrote, building a revision graph like in TSVN
requires the following two steps:
(1) starting from a given pair of revision and path / URL,
you follow its history in an 'svn log -v'-like manner
until you discover the earliest ancestor.
(2) from there, build the graph by following the history
forward. Every copy forks a new path in the graph at
the copy source node and every deletion ends a path in
the graph.
The main problem is that there is no SVN-API for step (2).
So what you have to do is fetching the *whole* history
from HEAD down to the revision of the earliest ancestor
for that part of the repository that contains all direct
and indirect copies of this ancestor.
For TSVN, this means getting the log for the repository
root, i.e. the whole repository. For subclipse that might
be limited to a particular module. Go to the subclipse
list and ask whether all branches, renamed copies (refactoring!)
and tags are 'confined' within some package path.
Depending on protocol and server load, 'svn log -v' will
give you 10 to 1000 revisions per second. A mid-sized project
like TSVN has about 10000 revisions. You may use the log
caching statistics in TSVN 1.5 to get some numbers for
e.g. total number of paths, average changes per revision etc.
You will also need a collection of data structures that
represent the log content. A hierarchy similar to the
output of 'svn log -v --xml' is a good starting point.
Don't go for off-line storage, performance or size optimization
at this point. Those can be added later. The TSVN history
may be misleading in that respect.
A minor detail: since the log contains only 'copy-from'
info, you have to add derive the 'copy-to' relation in
a second pass. You will need that information when building
the graph 'against the history stream'.
Building the revision graph itself is not too complex:
Just can the history upwards for relevant entries and
add the respective nodes to the graph.
The vicious detail here is the word 'relevant'. For every
copy, there is path in the graph an a given revision may be
relevant to it. The most common type of copy is a tag.
Depending on the development process, there may be one tag
per day or every other week. As a result, you may end up
with just 10k revisions to be matched against 1k paths.
This observation is crucial as it indicates quadratic growth
on already quite large numbers. To mitigate that, TSVN
organizes the search paths in a tree and matches the
revision changes against its nodes. For instance, the /tags
node containing the most paths can be skipped entirely
for most revisions. Now you are down to O(N log N) or better.
Once you got some basic graph builder in place, you will
notice that it has a terrible SNR and you will try to
remove nonessential information (nodes, that is) from it.
Go for it; it's actually fun. But implement your reductions
as a second pass over the complete revision graph.
TSVN was / is trying to do this in one go but that
made the code unnecessarily complex. On the other hand,
the filter and layout code turned out to be very fast
so that there is no point in optimizing the graph size
during its construction.
Finally, there are many ways to work around caveats and
edge-cases as well as to optimize performance. You should
*not* start by reading the TSVN code. It has been designed
for maximum performance on very large graphs and is currently
being re-factored. Start simple and focus on 'sufficient'
(ask the subclipse people what they expect) performance.
If you stumble across an issue with your implementation,
please feel free to ask (probably me) here on the list.
-- Stefan^2.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_tortoisesvn.tigris.org
For additional commands, e-mail: dev-help_at_tortoisesvn.tigris.org
Received on 2008-03-26 16:31:05 CET