[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Statsdlg - first patch: data gathering upgrade

From: Andreas Nicolai <Andreas.Nicolai_at_gmx.net>
Date: 2007-10-07 17:37:36 CEST

Hi there,

while I'm hacking away on the stats dialog, I created (+attached) the
first patch that includes the reworking of the stats data gathering

The patch only affects the files: StatGraphDlg.h and StatGraphDlg.cpp and
is created against revision 10908.

Here's a brief review of the code changes:

1. week count:
old: The previous implementation took the first date and the last date in
the array as time span.
new: The new implementation searches for min and max dates, then aligns
the earliest date with a date at the begin of the corresponding week, then
this date is stored in a new member variable m_minDate.

2. data gathering:
old: the previous implementation was implemented such that a lot of binary
searches (using lower_bound) were executed for _each_ commit. This caused
the noticable delay when opening the stats dialog for large number of
revisions (e.g. try "Show all" in the TSVN repository and open the stats
dialog). Also, reoccuring weeks due to later import of revision histories
would be treated as new weeks and thus not giving the correct stats.

new: The new implementation loops over all weeks in the intervals
determined in GetWeekCount() and stores for each week/interval the number
of commits and file changes per author, it also keeps track of the total
commit count and total file change count. At the same time the commits for
each author are stored in a mapping. Then a list of author names is
created and the list is sorted based on commit count. For that purpose I
wrote a binary predicate class MoreCommitsThan to be able to compare
authors based on their commit count. As a result, all the sorting during
the data gathering is no longer necessary and the time expensive
CountCommits() function can be removed alltogether. Further, the required
stats are obtained for the min/max author (first and last in the sorted
list) and the dialogs statistics can be shown.

I documented the new code fairly detailed so it shouldn't be too hard to
follow (I hope).

Just one thing I noted... Because of the aligning to begin/end of the
week, revision intervals that start in the middle of a week and end in the
middle of the week may actually be reported as one week longer than the
time span actually is. However, if I don't align the interval with the
start of the week, the weekly interval may actuall start on a Wednesday
and last until next weeks Tuesday. For a different revision range (maybe
including the previous 200 revs) the interval may be between Friday and
next weeks Thursday. This, however, results in different min/max commit
and file changes counts. So I guess I don't get around the aligning part,
and for the improved data gathering algorithm I need the m_minDate.

Design questions:
1. The data structures created in the ShowStats() dialog need to be used
in the other statistics functions as well. Re-gathering the data would be
a waste of time, so I would propose making these variables member
variables of the dialog that get populated when the dialog is first shown.
All other statistics views can then use the information and
obtain/calculate specific other data. Would that make sense having these
mappings and lists as member variables?

2. The maps for the commit and file change data is currently of type:
map<int, map<stdstring, LONG> > so that data can be accessed by:

LONG commits = commitsPerAuthorAndWeek[week_nr][author_name];

However, the memory needed for storing the data could be reduced if
instead of strings the authors would be identified by a number that and
the name/number connection is made via yet another mapping. So, the
statement above would look like:

LONG commits = commitsPerAuthorAndWeek[week_nr][authorNumber[author_name]];

Since the memory footprint of the statistics dialog is rather low compared
to the log dialog, I would probably postpone this upgrade until later.
Also, it would hurt readibility of the code, so I'd prefer the way data is
stored now. What are your thoughts on this?


Andreas Nicolai                         anicolai@syr.edu
PhD Candidate, M.A.M.E                  (315) 443-2641
Syracuse University
151 Link Hall
Syracuse, NY, 13244

To unsubscribe, e-mail: dev-unsubscribe@tortoisesvn.tigris.org
For additional commands, e-mail: dev-help@tortoisesvn.tigris.org

Received on Sun Oct 7 17:37:55 2007

This is an archived mail posted to the TortoiseSVN Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.