Hi Julian,
Thanks for responding.
On Thu, Jan 15, 2015 at 12:55 AM, Julian Foad
<julianfoad_at_btopenworld.com> wrote:
> Hi Patrik.
>
> Thanks for the report.
>
> I agree there's something wrong with the range calculations you point out.
>
>
> In order to help me follow along and examine the source code, please could you say exactly which programs and versions you are using for each step? Are you using 'svndumpfilter' for the filtering?
I'm using 1.7.19. The server that generated the initial dumpfile is
still running 1.6, I believe, if that matters.
I tried using svndumpfilter, but there has been enough copying back
and forth during the history of the repo that recursively including
all those paths became untenable. Instead, I used
https://github.com/jasperlee108/svndumpfilterIN, which uses svnlook to
copy the content of an out-of-path copyfrom when it finds one.
However, it doesn't do any renumbering of the revs, so if
svndumpfilter does that, that's probably where my problem (1) comes
from.
>
> 'svndumpfilter' and 'svnadmin load' are both supposed to adjust revision number references in both copyfrom and mergeinfo. However, I have looked at the code and it doesn't look robust.
I can definitely say that 'svnadmin load' choked when it got to the
first copyfrom with a nonexistent rev. I haven't look that carefully
at the code, but it seemed to me that it was only doing renumbering
due to differing start revs in the dump and the repo, not fixing
"holes" in the numbering.
Regards,
/Patrik
>
> See below for more notes.
>
>
> Patrik Jonsson wrote:
>
>> I'm trying to dump a very large (300GB) svn repo so we can split it up
>> into manageable sizes. However, I'm running into problems with
>> revision references. I don't know if these are bugs in svn or if our
>> repo somehow is corrupted, but here's an outline of the problems. Any
>> help would be appreciated. Explaining the issues in full will take
>> some space, so bear with me.
>>
>> 1. After filtering, there are Node-copyfrom-revs that refer to
>> filtered out revisions.
>>
>> In the dumpfile, rev X of path A has Node-copyfrom-path B and
>> Node-copyfrom-rev Y. However, rev Y didn't touch path B, so it's
>> filtered out, and svnadmin load dies with "svnadmin: E160006: No such
>> revision Z". (Note that it reports the nonexisting revision as Z,
>> which is in fact *not* Y. It's the nearest existing rev < Y.)
>>
>> This is not a problem with the filter; the original dump of the repo
>> contains the same references. Of course, there rev Y exists so there's
>> no problem loading it. If this is a bug, it presumably originated when
>> rev X was committed.
>
> You say it is not a problem with the filter, but I would like to know if you are using svndumpfilter, because that is supposed to fix up copyfrom revs, as I understand it, so that might be interesting to look at as well.
>
>> This appears to be the same problem reported previously in
>> http://svn.haxx.se/users/archive-2013-08/0560.shtml, but I saw no
>> resolution to that post.
>>
>> I have attempted to bypass this problem by editing the dump stream to
>> change Node-copyfrom-rev entries to nonexistent versions to instead
>> point to the nearest, earlier, existing version. This seems to fix the
>> problem. But then I run into the next:
>>
>>
>> 2. svnadmin load crashes with failed IS_VALID_FORWARD_RANGE assertion
>> in mergeinfo.c
>>
>> The next problem occurs when svnadmin load encounters a mergeinfo
>> property. (The revisions in the mergeinfo properties also point to
>> nonexistent revisions, just like Node-copyfrom-path above, so they
>> have also been remapped to the nearest existing revs.)
>>
>> The mergeinfo encountered is "A:X-Y". When this is parsed in
>> renumber_mergeinfo_revs, it turns into (X-1)-Y, because
>> parse_rangelist substracts one from firstrev. I don't know what this
>> -1 comes from,
>
> The 'svn_merge_range_t' struct always stores a range as: range->start = (X - 1), range->end = Y, for the range that is represented as "X-Y" in mergeinfo. That is the reason when the parser subtracts one.
>
>> but when renumber_mergeinfo_revs calls
>> get_revision_mapping on X-1 (which does not exist in the filtered
>> stream) it gets a bogus rev. (It maps to the rev immediately preceding
>> X in the dump stream.)
>
> Yes, that's a bug, I agree.
>
>> Y, however, maps correctly to the number in the
>> new loaded repo. This causes X>Y and the invalid range assertion
>> fires. If I manually call get_revision_mapping(X), I get the correct
>> rev for X in the loaded repo, so this is caused by the -1 in
>> parse_rangelist.
>
> Or rather this is caused by renumber_mergeinfo_revs() failing to add one before calling get_revision_mapping().
>
>> I see that if it gets an invalid rev from the mapping (which,
>> according to the comment, happens for the revision immediately
>> preceding the oldest revision in the stream) it uses the oldest rev
>> from the load stream, minus 1.
>>
>> If this is because the start rev in the merge_range is actually one
>> less than the start rev, it seems that rather than calling
>>
>> rev_from_map = get_revision_mapping(pb->rev_map, range->start);
>>
>> it needs to do
>>
>> rev_from_map = get_revision_mapping(pb->rev_map, range->start+1) -1;
>>
>> because while that may not matter if the revs are consecutive, if
>> there are filtered out revs in the stream, they are not the same. If I
>> make this change, the load appears to proceed OK.
>
> Yes, I agree. However, there is much more code present in renumber_mergeinfo_revs() than just that simple look-up. It is obfuscated code. I find it very difficult to understand what it is supposed to do and what it actually does. I would like us to rewrite it, and make it very much simpler.
>
>> Please CC me when replying, as I'm not subscribed to this list.
>
> - Julian
Received on 2015-01-15 19:22:58 CET