Hi Subclipse Devs,
a few days ago, Suran Jayathilaka showed up on the TSVN
dev list and asked for more information on how to implement
a revision graph. I put the following together. Please consider
it a prelude to the technical discussion below. Don't start
with the most demanding design possible ;)
Now, I saw that there has already been some discussion on
this list. While it is generally a good idea to look at existing
solutions to a given problem, it may not be such a good idea
to start by looking at TSVN's 'solution' for this particular problem.
Instead, you need some deeper understanding of the problem's
nature and how it restricts the options you have in implementing
a solution. [ BTW: the LogCache code is not 'without comments'
(25% comment lines) - claiming otherwise I will take as a
personal insult ;) ]
I agree with, and trust in, Mark Phippard's assessment that
you will need a log cache sooner or later. There are a number
of questions to answer until you can start designing it:
(1) Is this a generic log cache or revision graph-only?
-> Generic log cache, if you want to 'fix' delays when
showing the log for a given path
-> Revgraph-only, if you want to finish within 1 GSoC
(2) Must the cache data be persistent?
-> yes, if more than 10k revs have to be fetched
-> if 'no', skip other questions
(3) Are you memory, disk size or disk I/O bound?
-> mem / disk size: see discussion below
-> I/O may not be a problem, if the data remains in
memory for a whole 'eclipse session'.
(4) How fast must the analysis process be?
-> >100k revs / sec in TSVN.
~10k revs/sec may be acceptable for subclipse
(5) Do you need to share the cache between multiple
(concurrent) Eclipse instances?
-> not in TSVN 1.5, probably in TSVN 1.6 for performance.
-> update via persistent file should be sufficient in Eclipse
(6) Do you want to share the data with other tools, e.g. TSVN?
-> Tempting ;) But would require coordination *now* as the
data format will be hard to change later. A tentative 'NO'.
On (1). A revision-graph-only cache must only store revision
numbers, paths, modification type and copy-from info. That
leads to a simple path-as-tree-node representation for paths
plus a list-of-lists (list := unbounded, linear container) for the rest.
Storing all log information adds authors, comments, timestamps,
user revprops and finally merge tracking data. Comments alone
account for ~30% of the cached data in TSVN.
Moreover, comments and authors may be changed. Finally, as
you can't fetch 'all and everything' just to show a simple log of
some path, you must allow for 'missing data' plus the knowledge
that these gaps are irrelevant for certain sub-trees of the repository.
On (2). Assuming that there are larger Eclipse projects with more
than 10.000 revisions in their /trunk-/branches-/tags-tree, you will
need persistency. But you may add it later.
On (3). The TSVN benchmark has always been the KDE repository.
Its the largest one out there (revision-wise) and may contain obscure
artifacts from the CVS->SVN conversion. If that would be handled
suitably, everybody out there should be fine. Maybe, subclipse should
choose a similar benchmark (perhaps a mid-sized repo?) plus
desired performance numbers.
svn log -v --xml contains (almost) all data that can be cached while
redundancy is low. Every naive, non-optimized data structure can be
expected to require about that amount of memory and disk space.
However, that data passed the 500MB mark a while ago.
Therefore, TSVN uses optimized data structures in memory: A very
dense string container (UTF8 data plus 13 byte total overhead) is
the base. All strings in that container are *unique* and represented
by their index within the container. It's apparent to see how authors,
path tree nodes, comments as a sequence of words etc. can be
constructed on top of that structure. So, all this information is normalized
and referred to by a 4-byte value (keeps low footprint in 64-bit as well).
While this already reduces the data by 2/3, this would not be enough
to allow for low disk I/O times: TSVN is a short-lived process and must
fetch the whole data everytime a relevant command is called from the
Explorer context menu. Therefore, it stores the containers (most of them
contain only large lists of those strongly-coupled 4-byte values) using
super-fast differential, RLE and data length coding (see stream classes
in LogCache). Finally, a hand-tuned Huffman-encoder eliminates the
This whole process takes about 2G clock ticks to (de-)serialize the first
500k revs of the KDE repository. The file size is about 1/4 of the size
of the in-memory data structures. That's somewhere between WinZip
BTW, TSVN can export a log cache as a set of CSV files. This feature is
somewhat hidden: Settings->LogCache->Export.
To (4). I already wrote something about that in the post linked above.
Depending on the use-case, 10 seconds may be acceptable for the
initial analysis (building an unfiltered graph) of a mid-sized (10k revs)
project and about a minute for really large repositories. User-interactive
filtering and changing the layout is not time critical because it has to deal
with 10 .. 1000 times less data.
As you potentially need all log information (e.g. when showing the graph
for /trunk), no database index will probably be able to reduce the amount
of data to fetched. Therefore, have all of it in memory or be prepared to
read all of it quickly.
'Is pathA TheSameOrParentOf pathB?' is the most time critical operation
during revision graph construction - unless you come up with a radically
different approach. For large repositories and depending on implementation
details, it may be called million of times.
Making paths very small, unique objects organized in trees keeps all
relevant data in cache. Reducing such objects to half-ordered numbers
gets this operation down to the 100ns range.
To (5). This is only about losing updates when not having physically
shared run-time data structures: Instance 1 shows some graph and needs
to update the cache (fetch latest data). Instance 2 may show a graph for
a path in the same repository. If instance 2 already had read the cache
when instance 1 updated its, instance 2 will fetch the *new* data a second
time. When both write their caches back to disk, the last one wins.
I plan to enhance TSVN in 1.6 by having TSVNCache not only crawling
the wc for status but also to update the log caches. Moreover, it shall
provided them to the TSVN command processes via shared memory.
That would eliminate the (de-)serialization delay (~1 sec) and reduce
To (6). My default would be: let everyone use its own proprietary format.
As you can see from my explanations, the run-time data model as well
as its representation on disk are non-trivial, i.e. expensive to
re-implement. Also, TSVN would always be 'ahead' with its code - at least
until after 1.6. That results in version dependencies etc.
Sigh. That's already yet another lengthy post :/
Please don't let the TSVN implementation scare you away. And, honestly,
I don't know how far you can get with alternative approaches. They may
turn out to work o.k.
From my point of view, the key issue is effective (not just efficient) data
reduction. E.g.: If you go for a 'revision graph only' approach and you don't
want to show simple 'modified' nodes, just remove all 'M' changes and
then empty revisions. That may save you one order of magniture.
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Received on 2008-03-30 16:05:37 CEST