Re: FSFS "read length line" repository corruption -- still not fixed in 1.4.2

From: Malcolm Rowe <malcolm-svn-dev_at_farside.org.uk>
Date: 2007-02-14 12:02:09 CET

Hi Gergely,

On Tue, Jan 30, 2007 at 12:12:23PM +0100, Gergely Kis wrote:
> This is the famous "read length line" error, which can be fixed (in some
> situation) by John Szakmeister's also famous fsfsverify.py.
>
> The problem is that because of the nature of the bug (for a detailed
> description please look at http://svn.haxx.se/dev/archive-2006-02/0473.shtml
> ) it is not easy for the developers to reproduce the issue.
>
> Now with the release of the SVN 1.4.2 the developers believed that this
> issue was fixed.
>
> Unfortunately I can confirm, that this is not the case.
>

Thanks for the report. We did fix one way to cause this problem in 1.4.2,
though we were never able to prove that the problem we fixed was the one
that was causing the problems we were seeing (though it had exactly the
same results, so we suspected that it was).

Could you provide a little more detail about the corruption you're seeing?
The error you're getting indicates some form of corruption, though it
might not be of the same type as the one we fixed. Roughly how frequently
are you seeing this problem?

Is there any chance you can provide a corrupt revision file? (privately,
if necessary). Alternatively, if you're able to break apart the
revision file itself, can you verify whether you're seeing exactly the
same corruption as I described in the above email?

Could you also check to see whether you've any reports of an error message
starting "Cannot write to the prototype revision file of transaction..."?
It might appear in the Apache log, or it might have just been sent to
the client. If so, that should indicate that you have tripped over
the problem we _did_ fix in 1.4.2. If you're hitting it frequently,
there may still be a race condition allowing the problem to occur.

Have you any way to verify your (real, not Subversion) filesystem or
physical disks? Some of the problems we've seen have been traced to
failing disks in the past.

> I used custom built packages because:
> - I read that the APR 0.9.6, which is included with apache 2.0.54 does not
> report errors on buffered streams, and that this behaviour was fixed in
> 0.9.7 and later. It was suggested in the list archives that this could
> increase the possibility of this bug.

That was an initial theory from when we were wondering whether an
unreported disk-write failure could cause this problem. It certainly
won't hurt, but the reproduction for this problem in 1.4.0 and below
was based upon an undetected API violation (presumably in mod_dav_svn;
we're still not sure exactly how it occurs).

> - libdb4.4 includes those new APIs, which enable Subversion to automatically
> recover BDB repositories from an inconsistent state.
>

But you're not using BDB repositories, so this shouldn't make any
difference?

> I think I have done anything to minimise the risks, but I would like to make
> sure that:
> -this issue receives greater publicity: the Subversion Development Team
> should release an advisory about this issue with suggested workarounds)
> -it gets fixed in the foreseeable future.
>

We'd certainly like to fix it. Unfortunately, the facts are:

- This is a problem that's been reported a _very_ small number of times
(at most 10-15 users over the lifetime of FSFS).

- Some of those problems were definitively traced to failing hard drives
or filesystem corruption.

- We finally managed to find a way to cause the same sort of corruption
in 1.4.0, and we fixed that in 1.4.2. We still haven't really any
clue whether that's the cause of the problems we've been seeing.

- None of the developers or users has ever been able to reproduce the
problem on-demand.

- We have very little information about the problem other than that it
seems to be connected to mod_dav_svn (though it's possible that that's
just due to the relative popularity of the ra_dav access method).

So there really isn't a great deal of information that we could put in
an advisory. We're certainly not pretending that FSFS (or BDB for that
matter) is completely free of bugs, but the chances of hitting this
problem - and the limited amount we can actually say about it - suggest
that issuing an advisory would probably do more harm than good.

> I would also ask the SVN developers about the suggested course of action.
>
> Right now I see two alternatives:
> 1. convert all repositories to BDB and deal with the issues of BDB (library
> upgrades, possible corruptions at file system full, backup)

I'm nowhere near as familiar with the BDB-based filesystem implementation,
but I know that it's significantly improved over what we shipped in,
say, 1.1.x, so you could consider it, sure.

> 2. install a post-commit hook script to check each revision after the commit
> and send a mail to the admins / who checked in that the revision is corrupt.
>

That's probably an easier option - you only need to verify the single
revision that's been committed, and it can be done asynchronously in a
post-commit hook.

Regards,
Malcolm

application/pgp-signature attachment: stored

Received on Wed Feb 14 12:02:28 2007

This message: [ Message body ]
Next message: Malcolm Rowe: "Re: Quite an odd post-commit error after a successful commit"
Previous message: Vyacheslav Iutin: "Re: some gaps in my subversion knowledge"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]