Howdy folks,
We've been using mailer.py over at Google Code for years now (as if
the emails we generate weren't obvious proof of that!)
Recently, one of our engineers was scratching his head, wondering why
commits that modify thousands of files (e.g. an initial import) were
taking so incredibly long to generate emails for. It turns out that
mailer.py implements a "ChangeCollector" editor, and then asks
svn_repos_replay() to replay a single revision into it. This seems
like a really gigantic amount of work when dealing with large commits:
* brute-force compare revisions N-1 and N
* rediscover and store every changed path in a huge list
* iterate over the list, fetching files and generating diffs
For the first step, why aren't we just calling svn_repos_get_logs() to
instantly fetched the changed-paths list?
On the one hand, I understand that replay() is "perfectly correct", in
that it avoids some very rare edge-case bugs: (IIRC, get_logs() can't
tell the difference between a file that was copied, and one which was
copied and edited in the same revision.) Clearly, the 100% accuracy
of replay() is necessary for a tool like svnsync. But is this
accuracy really necessary for generating commit emails? Wouldn't the
commit emails generated via get_logs() and replay() look the same,
even in these edge-cases? We're paying a really heavy performance
price for replay() -- particularly for large commits -- and I wonder
if there's any benefit at all. My gut is to make mailer.py go back
to using get_logs() (or at least create an option for it).
Received on 2009-01-08 16:25:39 CET