Re: [Subclipse-dev] Revision graph and cache implementation

From: <Stefan.Fuhrmann_at_etas.com>
Date: Thu, 31 Jul 2008 17:17:42 +0200

"Alberto Gimeno" <gimenete_at_gmail.com> wrote on 07/31/2008 01:50:29 PM:

> Hi.
>
> I have just commited my work of this week. I have rewritten most of
> the revision graph plugin.
>

Hi Alberto!

Sorry that I didn't post earlier - just too busy ..

> The new cache implementation makes the following steps:
> 1.- store all the log messages without processing them. I'm using batch
updates.
> 2.- given a path and revision number it finds the first revision and
> path of that file (i.e. the root branch)
> 3.- from the root branch it iterates over the next revisions finding
branches.

That is what TSVN does right now. Making step 3 fast enough
is the critical part as well as setting a reasonable target.
Personally, I wouldn't use this feature if it took more than,
say, 10 secs for small to mid-sized projects, maybe a minute
for larger ones.

> The first step now is faster compared to the old implementation,
> because it doesn't makes any calculation and uses batch operations.
>
> The second and third steps are very fast because they don't write to
> the database. These steps just take about 8 seconds in a repository
> with 4.000 revisions in my computer.

Good. That is not too far from my personal (!) expectation
stated above.

Two things will have a major impact on performance: Number
of changes and number of branches / tags. For every change,
you must decide which of the branches (plural) are affected.
I wonder how many of them are in that 4.000 revision repository.

Here some numbers to clarify what I mean with "mid-sized"
project etc. Coming from the C world, I don't know whether
JAVA / Eclipse projects tend to be organized in a different
way that would make them considerably smaller.

Small project (~1k revs):
* kdesvn 1.5k revs, 10k changes, 50 branches /tags

Mid-sized project (~10k revs):
* TortoiseSVN 13.5k revs, 50k changes, 100 branches / tags

Larger project (~100k revs):
* (@ my office) 50k revs, 800k changes, 200 branches / tags
  refactoring and merging results in high changes:revs ratio
* KDE.org >800k revs, 9M changes, ~10k branches / tags
  multiple sub-projects up to 10% of those numbers in size
  (i.e. the relevant part is smaller but must scan all changes)

Decent performance for mid-sized projects would serve many
user's needs.

> I have some things I want to think about.
>
> a) The third step builds a graph in memory. There is no problem with
> memory usage because only the required information to show the graph
> is in memory.

Correct. Even a 50k nodes graph should not take more than
a few 10 megs.

> I could serializate the graph once it has been
> calculated, for a future usage. If in a future the graph needs to be
> updated I can deserialize it and add the new nodes.

You need also to serialize some "temporary" information like
the list of all active branches / tags. That may already
be part of your graph info, of course.

If you get the "update graph" feature working, it would
be a great way to solve virtually all performance and
scalability issues.

> b) With the last cache implementation I just make two reads to the
> dabase, and revisions and change_paths are just stored as they came
> from the repository. So, I'm thinking about storing that information
> in files instead in the database. Maybe a big text file.

Since you don't make use of any indexing features and nor
are you restricting reads to small sub-sets of the db,
it's just degraded into a convenient file interface.

Unless you can prove that the data transfer from the DB to
your algorithm is a bottleneck, stick with the DB for now.
Try to optimize the analysis first.

> But I don't
> know how fast would be reading a text file with 100k lines. Would it
> be much slower than reading a table with 100k records?

It would be faster. File I/O is at ~100M / sec - slower
during the first scan, faster once it is in the cache.
With about 100 bytes / change this gives 1M changes / sec.
Critical part here: individual I/O ops can be slow.
JAVA file streams may or may not provide an efficient
solution.

> I'm thinking of
> that because the queries I need are very simple: "give me all
> information from x revision to y revision". The only problem is
> skipping all lines until the 'x' revision.

Add a binary "index" file. It contains 2-byte integers
giving the length of every line in the text file. Load
that file first and reconstruct the line->offset mapping
(just a simple array).

> I'll need a file with the information in ascending order and other in
> reverse order. Having these two files I could throw away the database.

With the index file, you can comfortably walk your changes
file forward and backward. So, there is no need to duplicate
the changes file. Also, prepending to the reversed-order
file would be expensive.

-- Stefan^2.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subclipse.tigris.org
For additional commands, e-mail: dev-help_at_subclipse.tigris.org
Received on 2008-07-31 17:17:53 CEST

This message: [ Message body ]
Next message: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"
Previous message: Mark Phippard: "Re: [Subclipse-dev] Revision graph and cache implementation"
In reply to: Alberto Gimeno: "Re: [Subclipse-dev] Revision graph and cache implementation"
Next in thread: Stefan.Fuhrmann_at_etas.com: "Re: [Subclipse-dev] Revision graph and cache implementation"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]