[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff using token counts

From: Stefan Sperling <stsp_at_elego.de>
Date: Tue, 24 May 2011 22:46:05 +0200

On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
> Johan / Stefan - any progress on the reviewing part?

I haven't had time to review this, sorry.
It got a bit lost within the recent flurry of activity during and
after the Berlin hackathon.

I would also need some time to look at this in detail since my
working knowledge of the diff code is... suboptimal.

Note also that your change might be deemed a bit too experimental at
this point as the main focus right now is on stabilisation and bug
fixing for the upcoming 1.7 release.

> Only possible error I'm aware of is that I made the indices/counts
> 32 bit signed/unsigned integers, which means it'll run into trouble
> if there are more than ~2 billion different tokens (lines) or more than
> ~4 billion occurrences of the same token. If the library should be
> able to handle such files, they should be made 64 bit (for the
> indices, it should be safe to keep them 32 bit on 32 bit machines,
> as those could never have the memory to run such diffs anyway).

Extending this for 64bit platforms sounds ok to me.

We have similar blue sky limits elsewhere, e.g. svn patch uses a
64bit line counter for files it is patching (on all platforms -- but
it never reads more than one line into memory).

> I have a couple of other changes ready:
> One is a minor simplification, by getting rid of the idx parameter in
> lcs.c - there is a simpler way to accomplish what it's for.
> The other change is a partial implementation of the other possible
> enhancement I mentioned earlier, which takes advantage of unique
> matches to "restart" the lcs algorithm whenever possible, thus
> reducing the effect of the square computational complexity. The
> "partial" part is that this version is only really effective if the two
> files have roughly the same number of (non-unique) lines; it will not
> reduce the (P * length difference) part of the complexity. For good
> cases, though, it can reduce the time required by the lcs algorithm
> by large factors (roughly the number of well-separated sections of
> difference, if they are all about the same size).
> The question is how I should post these changes. The idx change
> is relatively independent of the others; should I post a patch from
> current HEAD revision for that change alone, should I post it in this
> thread or a new one, and should I specify how that patch interacts
> with the other patches (and if so, where)? The other patch is based
> on my original patch; should I post a patch of the additional
> changes or a cumulative patch from current HEAD revision?

I'd say try to break down your changes into a series of self-contained
patches that each solve a single problem, with one thread of each patch.

It's ok for patches you submit to depend on one another. Just note the
dependency in the submission email.

It's also fine for patches to overlap or even conflict if that helps to
keep them in smaller reviewable chunks. As some of them get committed
you can rebase your remaining work against HEAD and repost.
Received on 2011-05-24 22:46:42 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.