[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] prevent diff from extra/waste processing

From: Peter Lundblad <plundblad_at_google.com>
Date: 2007-03-28 22:49:18 CEST

Charles Acknin writes:
> On 3/28/07, Peter Lundblad <plundblad@google.com> wrote:
> > Charles Acknin writes:
> > > Daniel and I had a little talk on #svn-dev about something that seemed
> > > a bit weird in libsvn_wc/diff.c I noticed as running within gdb while
> > > writing my GSoC proposal: for single-file diff on WC/WC (e.g. svn
> > > diff foo), directory_elements_diff() would keep looping (line 692, rev
> > > 24199) until all files of foo's directory have been processed instead
> > > of breaking when foo is done. Those few lines will ensure we 'break'
> > > when files passed as argument are processed.
> > >
> > If I understand what you're saying correctly, this is not a correctness
> > bug, but just a minor performance glitch?
> >
>
> Well, about the nature of that 'thing', Daniel called it a bug while I
> never mentioned it as such. I might have not been clear : )
>
Hmmm... So, the bug is "we're taking a few more milliseconds than we
need to in some rare cases"... We're still ignoring the
rest of the entries, so it is not a correctness problem.

> > > @@ -697,6 +697,7 @@
> > > const svn_wc_entry_t *entry;
> > > struct dir_baton *subdir_baton;
> > > const char *name, *path;
> > > + svn_boolean_t file_diffed = FALSE;
> > >
> > > svn_pool_clear(subpool);
> > >
> > > @@ -726,6 +727,7 @@
> > > {
> > > case svn_node_file:
> > > SVN_ERR(file_diff(dir_baton, path, entry, subpool));
> > > + file_diffed = TRUE;
> > > break;
> > >
> >
> > It looks like this *always* breaks out of the loop after the first file.
> > Is this correct when the WC have more than one changed file/directory?
> >
>
> Yes, actually directory_elements_diff() gets called for each file
> that's passed on the CLI. The loop it gets called from is in
> subversion/svn/diff-cmd.c:245
> for (i = 0; i < targets->nelts; ++i)
>
> I mean directory_elements_diff() seems like a single-target-function.
>
Sorry, but you're mistaken. Look at the docstring for the
function and you'll discover that this function is used for a whole
subtree. Try doing "svn diff" in a WC where you have more than one
modified file, or files modified in a subdirectory.
(I'm pretty sure you'll see the problem if you run your modified
client in your working copy containing this patch...)

The libsvn_wc?diff.c file handles the case where you either diff WC -> WC
(comparing the BASE revision and WORKING for ecample) or where you compare repos->WC.
In the latter case, the repository will send diffs for parts of
the diff target that were actually changed between the diffed revisions
(one of which corresponds to the BASE revision(s) in the WC). The rest
is handled by this function.

Then, a layer higher up (no, actually two: libsvn_client and the cmdline
code), this whole diff thing is called multiple
times if there are multiple targets,
but that's a different story.

> > > case svn_node_dir:
> > > @@ -756,6 +758,9 @@
> > > default:
> > > break;
> > > }
> > > +
> > > + if (file_diffed)
> > > + break;
> > > }
> > >
> > > svn_pool_destroy(subpool);
> > >
> >
> > I guess the useless work you're after happens because the anchor and target
> > are different and we're still looping over all entries in the anchor
> > directory.
>
> The useless work happens because we keep going as the file
> directory_elements_diff() was called for is processed, when returning
> from file_diff(). If the above assumption I raised is right, why not
> break when the -- only -- target is file_diff()'ed?
>
That's not what your patch is doing. It breaks when
it has diffed the first file, even if it is not the target.
But as I said, there's nothing wrong with avoiding looping
over eerything if it doesn't complicate the code.
However, we will always have read the whole
entries file anyway, so it is really no big deal.

> > If you think that's a problem, maybe we should just get the entry
> > we want in this case. If you can find a way to do that without extra code
> > duplication added, then it might be worth it, but I doubt you can measure
> > any performance noticable difference in a reasonably sized directory.
>
> Yes it would be right to go straight with the target. And again, if
> this function is single-target, I can't see any reason to loop over
> all the 'entries' until it matches dir_baton->edit_baton->target. Or
> have I missed something?
>
> > Hint: Running "make check" before submitting a patch helps make sure it
> > doesn't break anything;)
>
> I did run the first 20 tests before sending as I saw #3
> 'diff-diff3-test' was called and returned on success but I didn't wait

That is a unittest for the internal diff library.

> until #30 that I'm now just realising it failed. Are we supposed to
> run them all? It takes a bit long... : /
>

Yes, we are, because we want to avoid regressions. You don't
have to sit and wait while the tests are running, though
I agree that they take some time:-(
So, please always make sure to run the complete
test suite locally before submitting a patch.
The default is to run using file:// and using fsfs. If you do work
in bdb, or in one of the other RA layers, make sure
to run the test for those as well.
This doesn't avoid regressions completely because of
platform/dependency differences etc., but helps a lot.

> I'm going to start investigating from the 9 places it failed as in the
> log. Or is it not worth a look at such a small performance issue?
>

In my experience, just looking at the code and trying to guess performance
problem and then starting to "fix" them without actually analyzing
the performance costs (profiling) is nearly always
a waste of time. In this case, we're looping over an in-memory
cache of the entries in one directory, and the wasteful work
is only done if we have a target that's different
from the anchor and only in the root of the operation.
I haven't measured, but I'm quite sure you can have
several thousands of entries before you'll see even a small
increase in time usage. We have *much* bigger performance problems, both in
the WC library and in other places to worry about. If you're interested in
such work, you're more than welcome and I'll try to give you feedback
if you want.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Mar 28 22:49:45 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.