[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Incomplete SVN dump files

From: Andreas Mohr <andi_at_lisas.de>
Date: Wed, 16 Sep 2015 07:47:56 +0200

Hi,

On Tue, Sep 15, 2015 at 05:26:38PM -0700, Eric Johnson wrote:
> I just checked, and there aren't any open bugs about this.
> Interrupting svnrdump can result in a dump file with not all the files of
> the last commit in the dump record. Accidentally use that dump file to
> load into a new repository, and the resulting repository will not be a
> copy of the original.
> My particular use case, I was trying to suck down a large repository.
> Connection interrupted part way through. I resumed from part way through
> (using the --incremental option) into an additional dump file. Then did a
> load of those two dump files. Did not yield a copy of the original
> repository, though.
> This seems like a critical issue for possible data loss when copying
> repositories from machine to machine using svnrdump.

AFAICS (not an svnrdump expert here) very well described and to the point.
You just managed to pinpoint a rather important serialization format
that seemingly isn't fully properly atomically transaction-safe...
(good catch!)

> I suspect the right solution to this is to put an "end of file" marker at
> the end of a dump stream. If it isn't there, then svnadmin load will see
> its absence, and must discard the last commit.

However a "file"-related "end of payload" marker does not necessarily cut it,
since "file" merely is a (rather unrelated) outer transport container
for (a flexible number of) inner sub elements of data.

Or, IOW, payload of each and every meaningful sub element
within the complete payload to be transmitted
best ought to (or rather: "MUST"?) be fully verifiable in itself.

To make this more evident,
inferring "discard this broken commit"
  due to a completely unrelated/foreign event "missing transmission end marker"
is a lot more indirect (completely unrelated mechanisms/reasons) than
inferring "discard this broken commit"
  due to the commit data payload full (outer) sub unit itself
  failing a cryptographic/checksum/length check *of this unit proper*.

(oh, and what about not only the case of having to discard the last commit only,
but instead detecting/discarding other commits within the stream
which happen to contain breakage?
talk about fully provided transaction safety...)

And then there is also the question of
whether it's even the serialization format itself
which is to specially add markers
of what constitutes a "complete" sub unit,
or whether it's the "higher-layer"
which is to "inherently/implicitly realize"
whether those chunks of data it got
do constitute a "complete" sub unit
(think layering - e.g. ISO etc.).

OTOH since serialization (format)
*is* generated by just *that* higher-level layer
"on the other side" of the parser side
(probably also svnrdump, right?),
*that* layer does fully define/control
the entire serialization format
and thus probably should insert
payload sub unit boundary/validity markers
(perhaps via a chunked file format or some such).

But these thoughts of mine here about this topic
could possibly be relegated to "ramblings" area,
since after all it's a simple(?) matter
of thoroughly researching current "Best Practice"
of implementing transaction-safe serialization formats
and then simply achieving just such a correct implementation... ;)

Andreas Mohr
Received on 2015-09-16 07:48:09 CEST

This is an archived mail posted to the Subversion Users mailing list.