Re: Handling non-reproducible test failures (was Re: 1.7.20 release candidates up for testing/signing)

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Fri, 20 Mar 2015 20:13:57 +0100

On Fri, Mar 20, 2015 at 10:30 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:

> On Fri, Mar 20, 2015 at 9:42 AM, Johan Corveleyn <jcorvel_at_gmail.com>
> wrote:
> ...
> > Unfortunately, I can't verify the rev
> > file, since I don't have it anymore, it has been overwritten by trying
> > to reproduce it (grrrr, should remember to backup leftover
> > repositories and working copies after a failed test run, before trying
> > to reproduce it). Whatever I try now, I can't reproduce it anymore
> > :-(.
>
> I'm wondering if something can be improved in our test suite to help
> diagnosis of hard-to-reproduce test failures. When this happens, you
> typically wish you could analyse as much data as possible (i.e. the
> potentially corrupt repository, working copy, dump file, ... that was
> used in the test).
>
> Currently, I can think of three causes for losing this information:
>
> 1) You run a series of test runs in sequence from a script
> (ra_local, ra_svn, ra_serf), all using the same target directory for
> running the tests (R:\test in my case, where R: is a ram drive). If
> something fails in ra_svn, but succeeds in ra_serf, your broken test
> data is overwritten.
>
> 2) You don't know in advance that the failure will turn out to be
> non-reproducible. You can't believe your eyes, try to run it again to
> be sure, and lo and behold, the test succeeds (and the broken test
> data is overwritten), and succeeds ever since.
>
> 3) Your test data is on a RAM drive, and you reboot or something. Or
> you copy the data to a fixed disk afterwards, but lose a bit of
> information because last-modified timestamps of the copied files are
> reset by copying them between disks.
>
>
> For 1, maybe the outer script could detect that ra_svn had a failure,
> and stop there (does win-tests.py emit an exit code != 0 if there is a
> test failure? That would make it easy. Otherwise the outer script
> would have to parse the test summary output)?
>
> Another option is to let every separate test run (ra_local, ra_svn,
> ra_serf) use a distinct target test directory. But if you're running
> them on a RAM disk, theoretically you might need three times the
> storage (hm, maybe not, because --cleanup ensures that successful test
> data is cleaned up, so as long as you don't run the three ways in
> parallel, it should be fine). I guess I will do that already, and
> adjust my script accordingly.
>
>
> Addressing 2 seems harder. Can the second test execution, on
> encountering stale test data, put that data aside instead of
> overwriting it? Or maybe every test execution can use a unique naming
> pattern (with a timestamp or a pid) so it doesn't overwrite previous
> data? Both approaches would leak data from failed test runs of course,
> but that's more or less the point. OTOH, you don't know that stale
> test data is from a previous failed run, or from a successful run that
> did not use --cleanup.
>
>
> And 3, well, I already reboot as little as possible, so this is more
> just a point of attention.
>
>
> Maybe addressing all three at the same time could also be: after a
> failed test run, automatically zip the test data and copy it to a safe
> location (e.g. the user's home dir).
>
> Thoughts?
>

I haven't thought too deeply about it but I think we should
be able to extend the current repo / w/c cleanup infrastructure
to copy the data away upon test failure.

-- Stefan^2.
Received on 2015-03-20 20:15:08 CET

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]