Re: philosophical questions about mailer.py

From: C. Michael Pilato <cmpilato_at_collab.net>
Date: Thu, 08 Jan 2009 10:32:58 -0500

Ben Collins-Sussman wrote:
> Howdy folks,
>
> We've been using mailer.py over at Google Code for years now (as if
> the emails we generate weren't obvious proof of that!)
>
> Recently, one of our engineers was scratching his head, wondering why
> commits that modify thousands of files (e.g. an initial import) were
> taking so incredibly long to generate emails for.

I've read this a few times now, and I'm left scratching my head. Here's
what I see:

"We're trying to figure out why an operation that does a ton of
stuff takes a while to describe."

Come again? What part of this was unexpected?

> It turns out that
> mailer.py implements a "ChangeCollector" editor, and then asks
> svn_repos_replay() to replay a single revision into it. This seems
> like a really gigantic amount of work when dealing with large commits:
>
> * brute-force compare revisions N-1 and N
> * rediscover and store every changed path in a huge list
> * iterate over the list, fetching files and generating diffs
>
> For the first step, why aren't we just calling svn_repos_get_logs() to
> instantly fetched the changed-paths list?
>
> On the one hand, I understand that replay() is "perfectly correct", in
> that it avoids some very rare edge-case bugs: (IIRC, get_logs() can't
> tell the difference between a file that was copied, and one which was
> copied and edited in the same revision.) Clearly, the 100% accuracy
> of replay() is necessary for a tool like svnsync. But is this
> accuracy really necessary for generating commit emails? Wouldn't the
> commit emails generated via get_logs() and replay() look the same,
> even in these edge-cases? We're paying a really heavy performance
> price for replay() -- particularly for large commits -- and I wonder
> if there's any benefit at all. My gut is to make mailer.py go back
> to using get_logs() (or at least create an option for it).

Hm. I think you don't understand svn_repos_replay(). It isn't a
brute-force comparison. It fetches the list of changed paths and then,
using path-math, drives an editor to describe the changes. (Read the big
comment atop libsvn_repos/replay.c)

So, yes, you can make mailer.py just fetch the changed paths. And then you
can make it do all the work of tracking locations changes that occur with
copies and moves (so you can accurately generate diffs). But now you've
practically implemented svn_repos_replay + the ChangeCollector. :-)

-- 
C. Michael Pilato <cmpilato_at_collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1011846

application/pgp-signature attachment: OpenPGP digital signature

Received on 2009-01-08 16:33:20 CET

This message: [ Message body ]
Next message: Ben Collins-Sussman: "Re: philosophical questions about mailer.py"
Previous message: Ben Collins-Sussman: "philosophical questions about mailer.py"
In reply to: Ben Collins-Sussman: "philosophical questions about mailer.py"
Next in thread: Ben Collins-Sussman: "Re: philosophical questions about mailer.py"
Reply: Ben Collins-Sussman: "Re: philosophical questions about mailer.py"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]