Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

From: Paul Burba <ptburba_at_gmail.com>
Date: Tue, 11 May 2010 16:00:56 -0400

On Tue, Apr 20, 2010 at 11:08 PM, C. Michael Pilato <cmpilato_at_collab.net> wrote:
> Paul Burba wrote:
>> Hi Mike and Julian,
>>
>> Sorry for the delayed response; been working on/thinking about other
>> mergey stuff lately...
>>
>> A few inline comments and then a proposed course of action at the end.
>>
>> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cmpilato_at_collab.net> wrote:
>>> The question then becomes, "How do users deal with legitimate partial dumps
>>> that will be loaded atop something other than loads from previous
>>> incremental windows?" I think they do this the same way they handle the
>>> copy case: either hand-hacking the dumpstream or using 'svndumpfilter'. So
>>> maybe 'svndumpfilter' grows the logic and options required to pare away
>>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>>> stream range")?
>>
>> I vote for svndumpfilter growing the logic to filter ranges predating
>> the start of the dump stream in the same way it strips mergeinfo with
>> merge sources filtered from the dump stream. Basically this means
>> moving the r927243 functionality to svndumpfilter. No really sure we
>> need a new option, seems we could overload
>> --skip-missing-merge-sources to include this behavior. Thoughts?
>
> +1
>
>>> Julian Foad wrote:
>>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>>
>>>>> 1. 'svnadmin dump' warns when it is in incremental mode and must generate
>>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>>> window, but it's only a warning and the dump continues.
>>
>> Why only --incremental mode? We should warn for both --incremental
>> dumps *and* non-incremental dumps because in the latter case we might
>> be specifying a dump range -rX:Y where rX>0* and there might be
>> mergeinfo that refers to revs < X. And while we might want to be
>> clever about skipping the warning when no range is specified (i.e.
>> defaulting to -r0:HEAD), even then we should warn because issue #3020
>> might already have resulted in a r1 with mergeinfo in the starting
>> repository.
>
> You're right, of course. I was thinking "partial dump" and saying
> "incremental". My bad.
>
>>>>> 2. 'svnadmin load' does nothing smart, trusting that the dump it's being
>>>>> fed is a sensible one.
>>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>>> warning if it refers to a non-existent source. The admin then gets to
>>>> figure out whether it was bad in the first place or because of his/her
>>>> partial-dump/partial-load scenario.
>>>>
>>>> However, it depends how efficient the checking is. If that would make
>>>> the 'load' really slow, I can see that not being wanted.
>
> [...]
>
>> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.
>
> (My vote was for 'svnadmin load' does nothing smart, and I stand by it.)
>
>> Taking what you've had to say into consideration, I propose the following:
>>
>> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
>> This stops svnadmin load from automatically filtering mergeinfo
>> referring to revisions outside of the load stream.
>>
>> 2) Reapply or reimplement the part of r327243 that accounts for the
>> case where the first loaded revision in a load stream has mergeinfo,
>> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
>> item #1.
>
> (r927243, you mean.)
>
>> 3) Keep the inline copy-source warnings as they exist today in
>> svnadmin dump, but add a new end-of-process summary warning e.g.:
>>
>> * Dumped revision 504.
>> * Dumped revision 505.
>> * Dumped revision 506.
>> * Dumped revision 507.
>> WARNING: Referencing data in revision 250, which is older than the oldest
>> WARNING: dumped revision (504). Loading this dump into an empty repository
>> WARNING: will fail.
>> * Dumped revision 58.
>> * Dumped revision 59.
>> * Dumped revision 510.
>> * Dumped revision 511.
>> .
>> .
>> .
>> WARNING: The range of revisions dumped contained references to copy
>> sources outside that range.
>>
>> 4) Implement inline and summary functionality analogous to 3) but for
>> mergeinfo ranges outside of the dump stream, e.g.:
>>
>> * Dumped revision 504.
>> * Dumped revision 505.
>> * Dumped revision 506.
>> * Dumped revision 507.
>> WARNING: Mergeinfo referencing revision(s) prior to revision 250,
>> which is older
>> WARNING: than the oldest dumped revision (504). Loading this dump may result
>> WARNING: in invalid mergeinfo.
>> * Dumped revision 58.
>> * Dumped revision 59.
>> * Dumped revision 510.
>> * Dumped revision 511.
>> .
>> .
>> .
>> WARNING: The range of revisions dumped contained references to
>> mergeinfo outside that range.
>>
>> Also, as mentioned earlier, I think this warning should kick in
>> regardless of whether --incremental is used.
>>
>> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
>> also remove mergeinfo that predates the starting revision of the dump
>> stream.
>
> +1 all over this stuff.
>
>> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
>> in the load steam or the target repos:
>>
>> A) It doesn't.
>>
>> B) Only checks that the revs are valid (fast, but
>> can allow mergeinfo that describes invalid
>> mergesource:rev pairs)
>>
>> C) Find an efficient way to test for valid
>> mergesource:rev pairs
>
> For now, A (or maybe B). Then we have our Paul back. :-)
>

Been down the rabbit hole on-and-off the last couple weeks with this.
Did everything we discussed so far in this thread, plus a slew of new
bugs, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc15
and onward.

At any rate, after my recent work we are left with one common(?) use
case still broken:

Loading a sequence of incremental dumps when the fist load in the
sequence is into a *non-empty* repostitory.

This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
from incremental dump', specifically see '# PART 4: Load a a series of
incremental dumps to an non-empty repository."

Basically the problem is this:

1) Given repository ReposA with X revisions in it.

   Note: Previously this use case was also broken if ReposB
   was empty to start, but this now works (assuming none of
   the incremental dumps in step 3 were svndumpfiltered in
   such a way as to remove/renumber revisions).

2) Given repository ReposB with Y revisions in it.

3) Incrementally (but fully) dump ReposA

   svnadmin dump ReposA -r0:100 > A.0.100.dump
   svnadmin dump ReposA -r101:200 --incremental > A.101.200.dump
   svnadmin dump ReposA -r201:300 --incremental > A.201.300.dump
   .
   .
   .
   svnadmin dump ReposA -rW:X --incremental > A.W.X.dump

4) Load the ReposA incremental dumps to ReposB

   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
   .
   .
   .
   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump

The problem arises when one of the incremental loads contains
references to mergeinfo source revisions that predate the start of the
dump. For example, say A.201.300.dump contains mergeinfo referencing
r82. When this is loaded into ReposB the reference should be changed
to r(82+Y) to reflect the fact that ReposB has Y revisions in it
before we ever started loading parts of ReposA. Currently no change
is made and the source rev stays at r82, which is obviously incorrect.

The attached patch fixes this problem. It takes all mergeinfo which
predates the first revision in the dump stream and adjusts it by the
difference between the head rev in ReposB and the current dump stream
revision. Yes, this is a fairly fragile solution (as is our copyfrom
sources mapping, which this is based on). If unrelated commits are
made to ReposB between loads of the incremental dumps, then the
mergeinfo source revs will be wrong; but it is *always* wrong right
now, at least this would fix this use case. Opinions appreciated.

And yes, there are still other potential problems: Think incremental
dumps that are dumpfiltered such that some revs are dropped or
renumbered. But I'm not sure we really want to solve *every* problem
in this space, I'm not even sure we can (without bringing the
performance of svnadmin load to its knees). I think this patch brings
us to "good enough" with the caveat that ill-advised use of svnadmin
dump/load and/or svndumpfilter can result in incorrect mergeinfo.

Paul

text/plain attachment: FinalIssue3020.log.txt

text/plain attachment: FinalIssue3020.fix.diff.txt

Received on 2010-05-11 22:01:27 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: svn commit: r943312 - /subversion/site/publish/mail"
Previous message: Greg Stein: "Re: svn commit: r943219 - /subversion/trunk/subversion/libsvn_wc/update_editor.c"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]