[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

From: Paul Burba <ptburba_at_gmail.com>
Date: Tue, 20 Apr 2010 18:11:47 -0400

Hi Mike and Julian,

Sorry for the delayed response; been working on/thinking about other
mergey stuff lately...

A few inline comments and then a proposed course of action at the end.

On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cmpilato_at_collab.net> wrote:
>
> The question then becomes, "How do users deal with legitimate partial dumps
> that will be loaded atop something other than loads from previous
> incremental windows?" I think they do this the same way they handle the
> copy case: either hand-hacking the dumpstream or using 'svndumpfilter'. So
> maybe 'svndumpfilter' grows the logic and options required to pare away
> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
> stream range")?

I vote for svndumpfilter growing the logic to filter ranges predating
the start of the dump stream in the same way it strips mergeinfo with
merge sources filtered from the dump stream. Basically this means
moving the r927243 functionality to svndumpfilter. No really sure we
need a new option, seems we could overload
--skip-missing-merge-sources to include this behavior. Thoughts?

> Julian Foad wrote:
>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>
>>> 1. 'svnadmin dump' warns when it is in incremental mode and must generate
>>> mergeinfo from a merge source that predates the beginning of the dump
>>> window, but it's only a warning and the dump continues.

Why only --incremental mode? We should warn for both --incremental
dumps *and* non-incremental dumps because in the latter case we might
be specifying a dump range -rX:Y where rX>0* and there might be
mergeinfo that refers to revs < X. And while we might want to be
clever about skipping the warning when no range is specified (i.e.
defaulting to -r0:HEAD), even then we should warn because issue #3020
might already have resulted in a r1 with mergeinfo in the starting
repository.

* Admittedly we could probably choose some other minimum rev, maybe r3
(which is the minimum numbers of revisions I can think to create
mergeinfo; r1 - populate the repos, r2 - create a branch and make some
changes on the branch parent, r3 - merge r2 to the branch.

>> +1 on that, both for consistency with copy-source info and because it's
>> a good policy anyway.
>>
>> A detail: I haven't tried it, but would want it to print this warning
>> (and the copy-source one) at the end of the dump, and not just in the
>> middle of the long list of "Dumping rX" messages.
>
> Currently the warnings are all "inline". But I agree that it would be
> beneficial to also just track boolean flags (had_suspect_copies,
> had_suspect_mergeinfo) throughout the dump and repeat warnings at the end:
>
> if had_suspect_copies:
> print "WARNING: The range of revisions dumped contained references "
> "to copy sources outside that range."
> if had_suspect_mergeinfo:
> print "WARNING: The range of revisions dumped contained references "
> "to mergeinfo outside that range."
>
>>> 2. 'svnadmin load' does nothing smart, trusting that the dump it's being
>>> fed is a sensible one.
>>
>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>> warning if it refers to a non-existent source. The admin then gets to
>> figure out whether it was bad in the first place or because of his/her
>> partial-dump/partial-load scenario.
>>
>> However, it depends how efficient the checking is. If that would make
>> the 'load' really slow, I can see that not being wanted.

I created a test repository with 1000 text changes, then 1000 merges
of each change, one at a time, to a branch. Then I dumped
(non-incrementally) the first 1000 revs to a dump file and the second
1000 revs to a second dump file. I loaded the first dump file into
two empty repositories. Then I used trunk build of svnadmin to load
the second dump file into the first repos. This took 00:03:28.890.
Then I used svnadmin with the attached patch to do the same with the
second repository. The patch uses svn_fs_check_path() to check if
each merge source-revision pair in the load stream which predates the
first rev of the load stream, already exists in the target repository
(which in this example they all do, there will be no filtering). No
unsurprisingly, this took significantly (62%) longer, 00:05:39.609.

This patch is fairly inefficient since it checks the same path-rev
pairs multiple times. E.g.: We load a dump file with orginignal
revisions start at r50. At r100 we encounter our first mergeinfo,
'/trunk:r50' on some path. We check that trunk_at_50 exists, it does,
and we allow the mergeinfo. Then we load r101, the mergeinfo on the
same path is now '/trunk:50-51'. We check that trunk_at_50 and trunk_at_51
both exist...and so on. We need to leverage that first check of
trunk_at_50 and not make it again (and again and again). Keep a
mergeinfo catalog of checked paths? That would work, but memory use
might be a problem with big repositories...

...So it can be made faster, but before I go further with it, does
svn_fs_check_path() seem the right way to go? I'm no FS guru, so
maybe there is a better way that is not obvious to me.

> Agreed.
>
> --
> C. Michael Pilato <cmpilato_at_collab.net>
> CollabNet <> www.collab.net <> Distributed Development On Demand

Taking what you've had to say into consideration, I propose the following:

1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
This stops svnadmin load from automatically filtering mergeinfo
referring to revisions outside of the load stream.

2) Reapply or reimplement the part of r327243 that accounts for the
case where the first loaded revision in a load stream has mergeinfo,
see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
item #1.

3) Keep the inline copy-source warnings as they exist today in
svnadmin dump, but add a new end-of-process summary warning e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Referencing data in revision 250, which is older than the oldest
  WARNING: dumped revision (504). Loading this dump into an empty repository
  WARNING: will fail.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to copy
sources outside that range.

4) Implement inline and summary functionality analogous to 3) but for
mergeinfo ranges outside of the dump stream, e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Mergeinfo referencing revision(s) prior to revision 250,
which is older
  WARNING: than the oldest dumped revision (504). Loading this dump may result
  WARNING: in invalid mergeinfo.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to
mergeinfo outside that range.

Also, as mentioned earlier, I think this warning should kick in
regardless of whether --incremental is used.

5) Overload svndumpfilter's --skip-missing-merge-sources option to
also remove mergeinfo that predates the starting revision of the dump
stream.

6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
in the load steam or the target repos:

   A) It doesn't.

   B) Only checks that the revs are valid (fast, but
      can allow mergeinfo that describes invalid
      mergesource:rev pairs)

   C) Find an efficient way to test for valid
      mergesource:rev pairs

Paul

Received on 2010-04-21 00:12:17 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.