Re: [RFC] Altering copyfrom information in repository

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Sat, 10 Dec 2011 22:34:04 +0100

On Wed, Dec 7, 2011 at 1:40 PM, Julian Foad <julianfoad_at_btopenworld.com> wrote:
> Hi Johan. See below...

Hi Julian. Thanks for your insights. You've obviously thought a lot
more about this stuff than I have :-). Some more below ...

> On 28 November 2011, Johan Corveleyn wrote:
>> On Mon, Nov 28, 2011 at 7:32 AM, Daniel Shahaf wrote:
>>> On Sunday, November 27, 2011 11:16 PM, "Johan Corveleyn" wrote:
>>>> <wild idea>
>>>> What if we could 'svnadmin (re)load' a single revision $REV in a
>>>> repository, which would then automatically fix up everything coming
>>>> after $REV:
>>>>
>>>>    0. Take backup
>>>>    1. Dump $REV
>>>>    2. Fix $REV.dumpfile with some dumptool
>>>>    3. Take repo offline
>>>>    4. Reload $REV (fixes up everything after $REV)
>>>>    5. Bring repo back online
>>>>
>>>> For the part of "... automatically fix up everything coming after
>>>> $REV":
>>>>
>>>>    - naive approach: simply dump+load internally in the repository
>>>> ("reload") everything from $REV+1 until HEAD.
>>>>
>>>>    - better approaches may be possible, depending on the change that
>>>> was done in $REV, and depending on the type of backend.
>>>>
>>>> Of course this reloading step will be more costly if $REV is far
>>>> before HEAD, but that's normal I guess. If you are able to fix
>>>> problems not too late after they happened, the reloading cost will be
>>>> reasonable.
>>>> </wild idea>
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Johan
>>>>
>>> You're asking how to implement a generic rewrite of a historical
>>> revision, but aren't addressing the question of what to do with
>>> younger-than-the-
>>> rename revisions that do not apply (in the libsvn_delta, libsvn_diff, or
>>> tree-delta sense) to the modified history.
>>
>> I'm not sure I understand. If all those younger-than-the-rename
>> revisions are "reloaded", there wouldn't be a problem, right? Ok,
>> maybe some of them don't need to be touched in any way, because they
>> do not apply to the modified history, but that can be seen as an
>> optimization, right?
>>
>> It's actually a bit similar to your suggestion of 'svnsync
>> --up-to-revision', which you made elsethread. But with dump/load, and
>> wrapped into a convenient tool for an svn administrator.
>>
>>> If you're serious about solving this problem I strongly suggest that you
>>> talk to Julian. I think he went up and down this path so much that he
>>> can tell the squirrels' furs' colors from hearing.
>>
>> Right. Julian, what do you think about all this?
>>
>> Is "making it easier to dump+load a single revision" an option to make
>> it possible to "fix history" (of a single revision)? Or is it a dead
>> end?
>
>
> That could certainly be helpful in implementing one part of any such history-editing feature. I see two difficult areas. Let's say you change rX.
>
> From a high-level point of view, what result do you want when a subsequent revision rY (where Y > X) touches a file or directory that would have existed in rX but no longer exists in rX because of the change made to rX? It's not difficult to specify some reasonable options here (things like: adjust rY to leave the final state of rY just as it was, which may involve recreating any nodes that were obliterated from rX; or delete the node; or bail out), it's just a matter of choosing, so in a sense this isn't a difficulty just a design choice.

Ok, that seems to be a question somewhat specific to obliteration (or
other forms of "fundamental" revision manipulation). But indeed an
important design choice to answer in this case.

> From an implementation POV, as soon as you replace rX with a new rX, the subsequent revisions in the repository become invalid unless the change you made to rX was very simple. Any deltas based on rX, any copy-from pointers, node Ids, and so on, may become invalid. So you can't in general replace rX inside the repository. If you did so, then r(X+1) up to HEAD would immediately become more or less unreadable, broken. One solution is to copy the whole repo up to r(X-1) and then load the new revisions into that copy of the repository. But if you really want to do this inside the repository, which is what I was trying to do, then in order to fix up all the revisions rX+1:HEAD you need to do something like either keep track in memory of what you are updating and rewriting, which gets quite complex; or fork the history inside the repository (leave the old rX in place, write a new chain of revisions rX' rX+1' rX+2', while reading from the original
> chain rX rX+1 ...), and then make the new (rX' ...) chain active and delete the old chain.

Ah yes. I hadn't considered that. As soon as rX is changed, it's no
longer certain that you can "dump" rX+1 or any subsequent revision.

> The benefit of 'forking' the chain of revisions is that the repository filesystem code can read the old revisions on request, and so you could for example convert them into dump file format. Conversely, to keep track in memory of what you are updating and rewriting, and traverse rX:HEAD fixing up as we go, that necessarily must be done at a very low level because those revisions are already 'broken' by the time we come to fix them up, and so they cannot be read by the normal APIs.
>
> That's the stuff I tried to get my head around before.
>
> If we choose to only support some very limited transformations within rX, then the 'traverse rX+1:HEAD, fixing them up as we go' approach could perhaps be simple enough to be feasible. But it's still low-level code and thus specific to each FS back-end, with the problem that FSFS is more in demand but BDB is much easier to do this sort of thing.
>
> Now I'm thinking the 'fork history inside the repo' or 'clone the repo' approaches are better, even though they require more disk space and/or more time, because being higher level gives several advantages. If we adapt your idea of making it easier to 'replace' a revision, and instead make it easier to import and export a revision, then that would certainly be a useful part of such a solution.

Yes, forking definitely seems superior. It seems much more robust (if
anything goes wrong during construction of the new fork, no problem),
and can even allow for doing things like this on a live repository.
That seems like the way to go :-).

Now, another shot at a "cheap solution" in the meantime: how about
documenting and/or scripting a standard way to do a dump+load of the
tail of the repository (since dumping and loading a big repository in
full can be so time-consuming). Or is that a known procedure already?

Say I want to change rX:

0) Assuming I have a backup.
1) Dump rX. Manipulate it.
2) Take repository offline.
3) Dump rX+1:HEAD
4) Make the repository forget rX:HEAD
5) Load rX:HEAD from the dumpfiles.
6) Take repository online

How to do step 4? For FSFS and for BDB? Is that already documented?

How about an 'svnadmin truncate' command to support this (making sure
people don't shoot themselves in the foot by following some
error-prone manual procedure)? Or is this simply too dangerous a tool?
It seems not more dangerous than being able to do 'rm -rf db/revs' if
you have those permissions. And we have the opportunity to give some
"are you sure?" warnings.

-- 
Johan

Received on 2011-12-10 22:34:58 CET

This message: [ Message body ]
Next message: Dmitry Batrak: "Fwd: Irrelevant revisions in 'svn log' output"
Previous message: Daniel Shahaf: "Re: Looks like wrong value for SVN_VER_REVISION in SWIG bindings for Python and Ruby?"
In reply to: Julian Foad: "Re: [RFC] Altering copyfrom information in repository"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]