Symmetry between dump and load

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Fri, 19 Dec 2014 12:23:11 +0000

I believe the following symmetries should be true, and testable, and we should test them.

For any valid repository:

* we can dump it
* we can load the dump file into a new repository
* the new repo is equivalent to the old repo

For any valid dump file:

* we can load it into a new repository
* we can dump that repository
* the new dump file is equivalent to the old dump file

WHY?

This thought was triggered after noticing that we keep finding more and more asymmetries (that is, bugs) in dump and load. Most of the ones I have paid attention to are related to mergeinfo. Examples:

#3912 svnadmin load does fail to process dumps with non UTF-8 path names
#4414 dump/load with invalid mergeinfo
#4476 Mergeinfo containing r0 makes svnsync and dump and load fail
#4492 svnrdump load assertion failure if Node-path starts with a slash
#4538 'load' strips r1 references in mergeinfo
#4539 Need a way to 'load' a dump without munging mergeinfo
#4573 mergeinfo parsing inconsistency: empty path

Why does this matter? Users care about stability. Waiting for a bug to show up, fixing it, and adding a regression test for that particular case gets us only so far. We could be pro-active, and go looking for these sorts of bugs much more aggressively. I think we should.

Why should we declare that these symmetries hold? Because we defined dump and load to be the canonical (or "lowest common denominator") back-up mechanism: its whole purpose is to represent the content of a repository unambiguously and completely and transfer that content to a different repository. (Oops, it fails in the "completely" department: it doesn't represent locks, for one thing.) And because we rely on these symmetries in our understanding and maintenance of the software.

Why should these symmetries be so tight that they can be mechanically tested, without an unmanageable number of intentional differences? Because we can't produce solid software if we can't test it!

HOW?

The meanings of "valid" and "equivalent" will need to be defined carefully. Here are some starting points for definitions.

"valid repository":
The result of any combination of:

* calling any libsvn_repos or higher level APIs, even with bad parameters and including calls that fail;
* calling APIs below libsvn_repos, in appropriate ways, with appropriate parameters and taking appropriate action if calls fail;
* starting with a "valid repository" produced by an older released version of Subversion, even if we consider that version to be buggy.

"valid dump file":
Any file that can be loaded without the loader throwing an error.

"equivalent repositories"
* when queried through libsvn_repos or higher level APIs, yield identical results; and
* when dumped, yield identical dump files.

"equivalent dump files"
* when loaded, yield equivalent repositories.

FUZZING

How can we possibly test all valid repositories and all valid dump files? Not by hand-crafted test cases, that's certain. However, the technique of repeatable, pseudo-random testing, aka "fuzzing", can enable us to approach closer and closer to complete test coverage, the more time we throw at it. Forget the idea that a test case has to have a predetermined coverage and has to run to completion every time we run "the tests". Instead, when run as part of the normal test suite, this "fuzzer" would generate a small number of test cases from pseudo-random inputs, and run them. These would be different each time it runs.

The "repeatable" part is that, whenever a generated test case fails, the parameters would be logged in a way that allows that specific case to be re-generated. Then it can be examined, re-tested against different builds, and, if it detected a real bug, inserted into the test suite as a separate, static regression test to be run every time.

The test code would also have a mode that tells it to keep generating and running pseudo-random test cases for a long or unlimited time.

OTHER SYMMETRIES

Subversion is quite rich in symmetries, more so than some other software because its job is to preserve data.

* svnrdump dump and load should be symmetrical. They should also be equivalent to svnadmin dump and load respectively, except as modified by RA layer constraints.

* svnsync should directly create an equivalent repository.

* Any query to a write-through proxy should return the same result as querying the master.

* Most of the Subversion library APIs have read and write interfaces which should be (broadly) symmetrical. Major ones include FSFS; FS; repos; delta; diff(+patch); RA; and to some extent WC.

* Many low-level two-way conversions should be symmetrical: reading/writing config files, parsing/unparsing mergeinfo.

* Getting more advanced... Any change or series of changes committed to 'trunk', we should be able to commit instead to a branch and then merge to trunk. If there were no changes (or no conflicting changes) made on trunk in the meantime, the end result should be identical.

* 'svn diff -rX:Y' and 'svn diff 'rY:X' should be mirror images.

* and many more!

Thoughts?

- Julian
Received on 2014-12-19 13:25:13 CET

This message: [ Message body ]
Next message: Branko ÄŒibej: "Re: Symmetry between dump and load"
Previous message: Julian Foad: "Re: Test suite doesn't detect httpd crashes"
Next in thread: Branko ÄŒibej: "Re: Symmetry between dump and load"
Reply: Branko ÄŒibej: "Re: Symmetry between dump and load"
Reply: Mark Phippard: "Re: Symmetry between dump and load"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]