On 07/22/2010 10:11 AM, Stefan Sperling wrote:
> So we should not have added the additional checksums in the first place?
> You might as well point fingers at the old backends for not storing checksums,
> rather than at Hyrum trying to do his best with what he had.
You have to define "his best" here. Hyrum could have done his best to avoid
adding to the cost of dumpfile creation by passing FALSE for the 'force'
parameter to svn_fs_file_checksum(). Hyrum could have done his best to
carry self-validating information for all file contents by passing TRUE,
instead, and then changing the surrounding the code to expect a valid
checksum from those function calls. Given the state of the code today,
Hyrum didn't do his best at either of those approaches. :-)
>> (Some of those may reduce away if the
>> backend is able to store the SHA1's forcibly generated from previous requests.)
>
> Is this already implemented? So if people upgrade the repository format,
> over time the dump performance will get better again?
It probably is for BDB repositories.
It probably is not for FSFS repositories.
I probably won't use this opportunity to gripe about FSFS. Again. Or maybe
I will.
> Let me explain why I did this:
>
> The idea for this change came to me in a conversation with Michael Diers.
> He said he saw no way of verifying that svnsync created proper duplicates
> of a repository. He had seen svnsync breaking repositories at a customer
> site like this:
>
> Transmitting file data .svnsync: Base checksum mismatch on 'foobar':
> expected: a9c9eb4994b50f9667e0c1509782df26
> actual: ae3957e324c0ee2228d76732c9a89747
[...]
> Obviously svnsync did verification *after* something got broken.
> But why didn't svnsync detect the checksum mismatch before committing
> a broken transaction? I don't know. They could not reproduce the problem
> either. But they've seen it and they are afraid of it happening again.
svnsync doesn't do any checksum verification. That work is done by deeper
levels. But you provide no support for the implication that Subversion
didn't attempt to verify checksums at all the reasonable times.
The error you show above happens deep inside the FS implementations' DAG
layer, as a sanity check on a text-base before apply a delta to it. It's
probably comparing the checksum stored in the FS for a given text-base
against the checksum coming across the wire as the delta-base checksum. The
FS layer doesn't accept untrusted checksums. That is, you can't tell the FS
what the checksum for a file's contents are -- you can only tell it the
file's contents, and you can tell it what you expect the checksum to be
after you store new file contents. I suspect the customer had actual
repository corruption that was only detectable by svnsync after the fact
because svnsync had nothing to do (directly, at least) with that corruption
in the first place.
> They trust Subversion with their data, and such problems don't exactly
> encourage them to keep trusting it. So I think we shouldn't dismiss the
> value of internal integrity checks entirely just because most parts of
> Subversion don't care [yet]. Of course such checks need to be placed
> in the right spots.
That's fair. I certainly don't want to play down the importance of
Subversion's ability to reliably version folks' data! I see the error above
as proof that Subversion is doing exactly that, not that it has failed to do so.
> Though I agree that relying on content checksums to make sure data copied
> from one computer to another has been copied correctly is not that
> important anymore these days. We're not copying data across noisy serial
> links anymore, and have checksumming on many different levels (e.g.
> ethernet CRC and disk controller CRC).
> But what additional verification shields us from is subtle bugs in our
> own code that cause data corruption. These cannot be caught by lower
> layers and evidently seem to exist (or maybe the customer had a hardware
> problem, but I doubt that).
>
> So what's nice is that the content checksums allow verification of data
> read back from a freshly loaded repository, or even from a transaction
> before commit. I.e. the loader can make sure that the data that it wrote
> into the repository matches what the dump file expects. This can catch
> bugs in the loader code. My plan was to make svnsync do similar checks later.
>
> As for integrity of dump files themselves, we should tell paranoid
> people to sign dump files after they've been created, and verify the
> signatures independently of Subversion.
Okay. It seems (if I'm reading you correctly) that we agree we shouldn't be
using checksums to verify the integrity of the dump stream. That the value
to be had here is in verifying that target repository, once loaded, has the
right data.
So may I suggest that, when dealing with property lists, it's just a lot
simpler (especially with our APIs) to compare the before-and-after property
lists than to trade in the currency of checksums? I mean, for file contents
it's arguably cheaper to do a checksum compare than a byte-for-byte
comparison (which isn't even possible in --deltas mode), and that mostly
because the checksum is pre-computed. But for properties? We don't have
the checksums handy (so there's extra cost to compute them), and we hold
those things en totale in memory all the time, and have handy routines for
finding diffs. I dunno. The property checksums just seem unnecessary in
the whole scheme of things to me.
(And I will further suggest that we fix the dump.c code to pass FALSE for
the 'force' parameter of svn_fs_file_checksum() calls, thus restoring the
pre-Hyrums-change behavior.)
--
C. Michael Pilato <cmpilato_at_collab.net>
CollabNet <> www.collab.net <> Distributed Development On Demand
Received on 2010-07-22 16:56:52 CEST