Re: svn commit: r965892 - in /subversion/trunk: notes/dump-load-format.txt subversion/include/svn_repos.h subversion/libsvn_repos/dump.c

From: C. Michael Pilato <cmpilato_at_collab.net>
Date: Thu, 22 Jul 2010 09:00:25 -0400

On 07/22/2010 06:15 AM, Stefan Sperling wrote:
> On Thu, Jul 22, 2010 at 09:53:35AM +0100, Philip Martin wrote:
>> Perhaps we could just have one checksum that includes all the property
>> names and values?
>
> I'd like loaders to be able to tell users which properties are
> corrupted.

This change feels obnoxious to me, adding still more cost to the dump
process for little theoretical gain and zero practical gain (since our own
loader does nothing with the information).

When the dump stream format and code was first designed, if the FS backend
hadn't calculated and stored a checksum for the file contents, no checksum
was placed into the dumpstream. Further, when in --deltas mode, the
checksum you (might) get in the dumpstream is, again, the checksum stored in
the FS backend for the fulltext of the file contents -- it doesn't represent
the checksum of the svndiff-compressed delta in the stream. This was the
overarching sentiment of the dump format: to carry information stored in
the FS backend, as much as can be useful to the folks on the receiving end,
but certainly without unnecessary cost to the dumpfile producer.

Back when Hyrum did his checksum API overhaul (in r872722, to be precise),
he made a subtle change here that deviated from this sentiment. He told the
dumpstream generator code to forcibly generate checksums for placement in
the dump stream (he passed 'TRUE' for the 'force' parameter of the new
svn_fs_file_checksum() function everywhere). The code today is left reading
kinda weird as a result:

   checksum = svn_fs_file_checksum(FORCE)
   hex_digest = svn_checksum_to_string(checksum)
   if (hex_digest)
      ... #### Wait -- "*if*" hexdigest? You forced its creation above!

Given the vast amount of Subversion data in existence today that probably
doesn't have an FS-stored SHA1 checksum (since we introduced that only in
the past couple of years), the immediate affect of these changes was that
the dumper code was now running over every fulltext of every file an
additional time just to generate SHA1's (since older repositories didn't
store SHA1's). And then again for copyfrom sources. And then again for
text bases (when in --deltas mode). (Some of those may reduce away if the
backend is able to store the SHA1's forcibly generated from previous requests.)

Hyrum also changed the loader code so that it forcibly acquires a checksum
for copyfrom sources for comparison against the copyfrom-checksum values
from the stream.

As if dump/load needed to run any slower.

You've extended this behavioral change now. Tell me -- if the FS backend
and the rest of Subversion don't care enough about the possibility of
property corruption to store and transmit checksums for those things,
where's the real value in teaching the dump format (alone) to do so?

Now for some nit-picking. This format:

Header-Name: (PROPNAME) CHECKSUM

is unnecessarily complex and, I bet, will catch at least person offguard
when their dumpstream contains a property name with a ')' character in it.
Checksum strings have a constant length and a fixed character set, so you
can simply do:

Header-Name: CHECKSUM PROPNAME

(which also has the nice property of looking similar to the output of
'md5sum' and 'sha1sum').

-- 
C. Michael Pilato <cmpilato_at_collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

application/pgp-signature attachment: OpenPGP digital signature

Received on 2010-07-22 15:01:28 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]