Re: svn commit: r965892 - in /subversion/trunk: notes/dump-load-format.txt subversion/include/svn_repos.h subversion/libsvn_repos/dump.c

From: Stefan Sperling <stsp_at_elego.de>
Date: Thu, 22 Jul 2010 16:11:07 +0200

On Thu, Jul 22, 2010 at 09:00:25AM -0400, C. Michael Pilato wrote:
> On 07/22/2010 06:15 AM, Stefan Sperling wrote:
> > On Thu, Jul 22, 2010 at 09:53:35AM +0100, Philip Martin wrote:
> >> Perhaps we could just have one checksum that includes all the property
> >> names and values?
> >
> > I'd like loaders to be able to tell users which properties are
> > corrupted.
>
> This change feels obnoxious to me, adding still more cost to the dump
> process for little theoretical gain and zero practical gain (since our own
> loader does nothing with the information).

My plan was to make the loader do something with the information next.

> When the dump stream format and code was first designed, if the FS backend
> hadn't calculated and stored a checksum for the file contents, no checksum
> was placed into the dumpstream. Further, when in --deltas mode, the
> checksum you (might) get in the dumpstream is, again, the checksum stored in
> the FS backend for the fulltext of the file contents -- it doesn't represent
> the checksum of the svndiff-compressed delta in the stream. This was the
> overarching sentiment of the dump format: to carry information stored in
> the FS backend, as much as can be useful to the folks on the receiving end,
> but certainly without unnecessary cost to the dumpfile producer.

That reasoning makes sense. But there's no reason the FS backend
could not provide more information in the future.

> Back when Hyrum did his checksum API overhaul (in r872722, to be precise),
> he made a subtle change here that deviated from this sentiment. He told the
> dumpstream generator code to forcibly generate checksums for placement in
> the dump stream (he passed 'TRUE' for the 'force' parameter of the new
> svn_fs_file_checksum() function everywhere). The code today is left reading
> kinda weird as a result:
>
> checksum = svn_fs_file_checksum(FORCE)
> hex_digest = svn_checksum_to_string(checksum)
> if (hex_digest)
> ... #### Wait -- "*if*" hexdigest? You forced its creation above!
>
> Given the vast amount of Subversion data in existence today that probably
> doesn't have an FS-stored SHA1 checksum (since we introduced that only in
> the past couple of years), the immediate affect of these changes was that
> the dumper code was now running over every fulltext of every file an
> additional time just to generate SHA1's (since older repositories didn't
> store SHA1's). And then again for copyfrom sources. And then again for
> text bases (when in --deltas mode).

So we should not have added the additional checksums in the first place?
You might as well point fingers at the old backends for not storing checksums,
rather than at Hyrum trying to do his best with what he had.

I don't see a big problem. Just that it would be much nicer to have the
checksums in all repositories already.

> (Some of those may reduce away if the
> backend is able to store the SHA1's forcibly generated from previous requests.)

Is this already implemented? So if people upgrade the repository format,
over time the dump performance will get better again?

> Hyrum also changed the loader code so that it forcibly acquires a checksum
> for copyfrom sources for comparison against the copyfrom-checksum values
> from the stream.
>
> As if dump/load needed to run any slower.

Property values are usually not large, so the overhead introduced
by my change should be much lower than that of Hyrum's change.

> You've extended this behavioral change now. Tell me -- if the FS backend
> and the rest of Subversion don't care enough about the possibility of
> property corruption to store and transmit checksums for those things,
> where's the real value in teaching the dump format (alone) to do so?

Of course, there isn't much value in that alone.
But it's not as if the rest of Subversion could not get any more
paranoid in the future.

Let me explain why I did this:

The idea for this change came to me in a conversation with Michael Diers.
He said he saw no way of verifying that svnsync created proper duplicates
of a repository. He had seen svnsync breaking repositories at a customer
site like this:

Transmitting file data .svnsync: Base checksum mismatch on 'foobar':
expected: a9c9eb4994b50f9667e0c1509782df26
actual: ae3957e324c0ee2228d76732c9a89747

To fix replication they had to rollback the mirror to an earlier revision.
Now that replication was fixed, the customer asked what could be done to
make sure no such corruption had occurred earlier, but went unnoticed.
So you can dump the master and dump the clone, and then compare the
dump files. Then another question was raised: If dump files can easily
be verified with e.g. sha1sum, and incremental dumps exist, why use svnsync
in the first place if svnadmin dump/load, sha1sum and rsync can effectively
provide the same functionality, but with better verification?

create dump on master
note sha1 checksum
copy dump to clone
load dump on clone
dump revision just loaded
note sha1 checksum
compare checksums

Obviously svnsync did verification *after* something got broken.
But why didn't svnsync detect the checksum mismatch before committing
a broken transaction? I don't know. They could not reproduce the problem
either. But they've seen it and they are afraid of it happening again.
They trust Subversion with their data, and such problems don't exactly
encourage them to keep trusting it. So I think we shouldn't dismiss the
value of internal integrity checks entirely just because most parts of
Subversion don't care [yet]. Of course such checks need to be placed
in the right spots.

Though I agree that relying on content checksums to make sure data copied
from one computer to another has been copied correctly is not that
important anymore these days. We're not copying data across noisy serial
links anymore, and have checksumming on many different levels (e.g.
ethernet CRC and disk controller CRC).
But what additional verification shields us from is subtle bugs in our
own code that cause data corruption. These cannot be caught by lower
layers and evidently seem to exist (or maybe the customer had a hardware
problem, but I doubt that).

So what's nice is that the content checksums allow verification of data
read back from a freshly loaded repository, or even from a transaction
before commit. I.e. the loader can make sure that the data that it wrote
into the repository matches what the dump file expects. This can catch
bugs in the loader code. My plan was to make svnsync do similar checks later.

As for integrity of dump files themselves, we should tell paranoid
people to sign dump files after they've been created, and verify the
signatures independently of Subversion.

So there is some value in having additional verification steps built-in.
We can apply this idea to other subsystems. We now have sha1 checksums of
pristines in the working copy, for instance. This means that it's easy to
detect a corrupt pristine -- open it, read content and verify checksum.
I'm not saying we should do this all the time by default, but like
'svnadmin verify', would could have a command that verifies pristines
(possibly while removing ones that are no longer referenced).
We'll just have to figure out where such checks will be beneficial.

But I don't mind reverting this change if it really bothers you.
It's just a small step on a longer road anyway. I think we should go
down that road and try to catch corrupted data before committing to it,
but dump file property checksums aren't the most important part of that.
Assuming that we may some day get a new FS backend that stores checksums
of property content, could we re-add the feature?

> Now for some nit-picking. This format:
>
> Header-Name: (PROPNAME) CHECKSUM
>
> is unnecessarily complex and, I bet, will catch at least person offguard
> when their dumpstream contains a property name with a ')' character in it.
> Checksum strings have a constant length and a fixed character set, so you
> can simply do:
>
> Header-Name: CHECKSUM PROPNAME
>
> (which also has the nice property of looking similar to the output of
> 'md5sum' and 'sha1sum').

Well, those tools are linux-specific, and other unix systems use
similar output to what I used:

$ echo foo > /tmp/foo
$ sha1 /tmp/foo
SHA1 (/tmp/foo) = f1d2d2f924e986ac86fdf7b36c94bcdf32beec15

Though I like your suggestion better, because it will also align
property names nicely.

Stefan
Received on 2010-07-22 16:12:15 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]