On Thu, Jan 8, 2009 at 9:32 AM, C. Michael Pilato <cmpilato_at_collab.net> wrote:
> Ben Collins-Sussman wrote:
>> Howdy folks,
>>
>> We've been using mailer.py over at Google Code for years now (as if
>> the emails we generate weren't obvious proof of that!)
>>
>> Recently, one of our engineers was scratching his head, wondering why
>> commits that modify thousands of files (e.g. an initial import) were
>> taking so incredibly long to generate emails for.
>
> I've read this a few times now, and I'm left scratching my head. Here's
> what I see:
>
> "We're trying to figure out why an operation that does a ton of
> stuff takes a while to describe."
>
> Come again? What part of this was unexpected?
We expected the diff-generation on thousands files to take a long
time, sure... but we are seeing just *replay* take a ridiculously long
time to run (on the order of 10's of minutes), whereas a direct call
to get_logs() only takes maybe 20 seconds. So I assumed that replay
was doing brute-force tree-comparisons!
> Hm. I think you don't understand svn_repos_replay(). It isn't a
> brute-force comparison. It fetches the list of changed paths and then,
> using path-math, drives an editor to describe the changes. (Read the big
> comment atop libsvn_repos/replay.c)
Hm, okay, then we need to investigate why get_logs() is so quick
compared to replay() on big imports -- something is fishy, for sure.
>
> So, yes, you can make mailer.py just fetch the changed paths. And then you
> can make it do all the work of tracking locations changes that occur with
> copies and moves (so you can accurately generate diffs). But now you've
> practically implemented svn_repos_replay + the ChangeCollector. :-)
Doh.
Received on 2009-01-08 16:39:55 CET