Re: svn obliterate - more feasible these days?

From: Stefan Fuhrmann <stefan2_at_apache.org>
Date: Tue, 9 Apr 2019 02:32:20 +0200

On 01.04.19 15:13, lists_at_m8y.org wrote:
> On Fri, 29 Mar 2019, Johan Corveleyn wrote:
>
>> On Fri, Mar 29, 2019 at 7:25 PM <lists_at_m8y.org> wrote:
>> ...
>>> Now, when I run into an hg feature that I particularly find useful,
>>> I ask #svn if there's an equivalent for use at work.
>>> In this case it was hg censor which allows easy removal of a change
>>> from the history.
>>> Intended for fixing issues with licensing or accidentally uploaded
>>> keys or passwords.
>>>
>>> I asked what svn had along those lines and following conversation
>>> ensued:
>>> https://m8y.org/chats/svn_obliterate.xhtml
>>> ^^^ Link to IRC chatlog of mostly danielsh explaining complexities
>>> with amending history in svn ^^^
>>>
>>> danielsh suggested I move this to the list. So I did.
>>> Is he right in that it might be easier to implement with this new
>>> filesystem? Are there still gotchas due to svn's design?
>>
>> Hi Nemo, thanks for bringing this to the list :-).
>>
>> I can't comment on the feasibility of implementing this in the
>> filesystem, and whether FSFS f7 (with logical addressing enabled -- it
>> is the default for f7, but is optional) makes it easier than f6.
>> Perhaps Stefan Fuhrman, who wrote most of the FSFS7 code, can share
>> some insight ...
>>
>> However, at the Aachen 2017 hackathon we ended up discussing
>> obliterate a bit [1] ("What hackathon is complete without a discussion
>> of obliterate?"). We focused on another hairy part of the problem:
>> what (should) happen(s) with existing working copies? How should
>> clients handle the rewritten history?
>>
>> Some options:
>> (-1) Client doesn't notice the history change. Existing working
>> copies may or may not break at any time, in unpredictable ways.
>>
>> (0) Client detects the changed history, and errors out with: "your
>> working copy has become unusable, check out a new one". This is
>> already possible today, by changing the UUID of the repository.
>>
>> (1) Only working copies which are affected by the history change get
>> invalidated.
>>
>> (2) Working copies which are affected automatically adjust / rebase
>> / remove the obliterated content.
>>
>> See also this thread from last year [2] where some ideas were bounced
>> around (including a bit about "what should clients do with existing
>> working copies?" [3])
>
TL;DR: Obliterate has become quite feasible with fsfs7 but may not be
the best solution for your use-case.
> Eh, I don't feel it's a hijack, I'm curious if it's technically
> feasible, but it's good to know people are actually thinking about
> implementation issues.
> FWIW, I tried mercurial DVCS' censor and it worked pretty much as I
> expected.
> That is, there's no attempt to alter the history of remote clones
> (good IMO).
>
> So, if you cloned prior to the censor, you get the unmodified copy.
> Further updates do not change this.
> If you clone after the censor you get the modified copy.
The equivalent problem in SVN manifests in any working copy containing
the obliterated data. In fact, if it did at any point in the past, the
pristine
store may still contain it. There is no enforced "update + cleanup".
>
> I don't know how well this maps to SVN's centralised approach, but
> treating the working copy similarly makes sense to me...
Yes.

As Johan already said, there should be should be a way to validate a
given working copy against the repository. Without that, you may see
things like "checksum mismatch" errors during commit etc. We don't
have that feature right now, though.
>
> Possibly related, what happens to working copies now, if I use
> svndumpfilter or authz to hide/remove a file from the repository?
Semantically, they break as they refer to a repository that no longer
exists. In practice, it depends: If your revision(s) of your checked out
branch did not change, you should be fine. It is just, that there is no
way for the user to tell whether that is the case.

Back to the actual obliterate. Subversion has been designed to never
modify history. That makes it fool-prove as you can always revert to
any previous revision (which may also be a legal requirement in some
cases). So, if you don't want certain files in your project, just delete
them and commit; you will never lose any data.

That already covers a use-case that is annoying to handle in VCS
which replicate the whole repo / history on the client. Furthermore,
if the data is legal to be in the repository (e.g. nothing that your
company has no right to), you might as well keep it. Set up authz
on it to hide it from non-authorized users, if you need to be sure
that sensitive data will be protected anywhere in the history. Authz
should be fast enough these days.

Beyond that point, obliterate becomes an option. Here is how you
might implement it on FSFSv7:

(1) Identify the node-revision(s) that contain data to be removed.
   Note: removing revisions themselves is a lot more work with
     little added benefit.
(2) Those noderevs point to the actual representations to obliterate.
(3) Scan the repository for all noderefs and representations (delta!)
     that point to representations to obliterate. Thanks to the index
     data in FSFSv7, this is basically a linear read at full disk
throughput.
(4) Update dependent data.
(5) Remove obliterated representations.
(6) Bump instance ID and tell users to checkout anew
     (maybe, change the repo URL?).

What you do in (4) depends on you use-case. Representations (file,
directory and property content) is usually stored as delta against
some previous representation. One option is replace all nodes along
the delta tree with empty representations. Another option is to store
their contents in full instead of a delta against obliterated data.

FSFSv7 makes it easy to change the side of piece of data without
breaking any pointers: The index contains the mapping.

As far as removing representations (5), I would suggest to replace
them with empty ones. If you need to remove any trace of them,
then you must also scan for directories in (3) and must update
their representations, too, if they reference obliterated data. That
would be somewhat slower but can still be done with a single scan.
You need to track more info on the fly, though.

I hope that this short sketch gives a good idea of what needed to
be done on the server side. A simple version of this (replace with
empty, rewrite dependents with full content) should be doable with
a couple of hundred LOCs. Right now, nobody is actually working
on this, though. If you have the spare cycles, take a look at the
fsfs-stats code for how to scan v7 repos and give it a go.

-- Stefan^2.
Received on 2019-04-09 02:32:21 CEST

This message: [ Message body ]
Next message: Julian Foad: "API to get revision size on disk"
Previous message: Ashish David: "- - URGENT REQUIREMENT: Hook Scripts For Visual SVN Server - -"
In reply to: lists_at_m8y.org: "Re: svn obliterate - more feasible these days?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]