Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Wed, 06 Oct 2010 11:21:44 +0100

We found some corruption in a FSFS repository we were using at work. I
have written a script (attached) to fix most but not all of it.

WHAT WERE THE SYMPTOMS?
-----------------------

The version of mod_dav_svn being used was 1.6.9.

A user got an error trying to commit one particular file, and also when
attempting to check out a fresh WC. I don't have details of these.

Then 'svnadmin verify' was run on the repo, and revealed several corrupt
revisions, with the following three kinds of error:

* svnadmin: Corrupt node-revision '5-12980.0.r12980/5571'
svnadmin: Found malformed header in revision file

* svnadmin: Corrupt representation '13001 1496 2082 16645 [...]'
svnadmin: Malformed representation header

* svnadmin: Reading one svndiff window read beyond the end of the
representation

There were dozens of the first kind, a few of the second kind and one or
two of the third kind.

The corrupt revisions were spread over a period of a few weeks, with no
corrupt revisions before that or after that. We know of nothing special
about that time period.

ANALYSIS
--------

I used both plain text searching and John Szakmeister's 'fsfsverify.py'
to help analyze the revision files. Here are just the brief results of
what I found.

Most of the 'Corrupt node-revision' errors were due to the byte-offset
part of the node-rev id being wrong. This error occurred with many
different node-rev ids. A corrupt revision contained from one to
several ids with wrong byte-offsets. Each particular node-rev id
appeared in several different revisions after the one in which it was
created, and it appeared correctly in some of them and wrongly in
others, with no discernable pattern. Every time it appeared wrongly, it
had the same wrong value, so there were only two variants of each
node-rev id: the right one and the wrong one. The byte-offset was
always fairly close to the correct value, but off by about 5 to 500
bytes. The wrong byte-offset did not point to any special place in the
target revision file, such as the start or end of a data blob, so
svnadmin reported 'Found malformed header'.

One or two 'Corrupt node-revision' errors were wrong in another way. A
directory entry reference to a subdirectory named 'X' (not its real
name) had the exact value 'dir 6-12953.0.r12953/30623'. Exactly one of
the node-revs created in r12953 was named 'X', and it was a directory at
the right path, and its node-rev id was '0-12953.0.r12953/30403'.
Therefore I concluded that that is the correct replacement. Note that
both the node-id component and the byte-offset part were wrong.

The 'Corrupt representation' errors were also due to a byte-offset being
wrong. The second number, '1496' in the above example, is supposed to
be the byte-offset in the revision file. Like the node-rev byte
offsets, these were typically off by a small amount.

I did not investigate or fix the 'Reading one svndiff window ...' error.

THE SCRIPT TO FIX THE ERRORS
----------------------------

Usage:
./fix-repo REPO-DIR START-REVNUM

Files (attached, separately and as .tgz):
  fix-repo # shell script, iterates over rev numbers; calls ...
  fixer/fix-rev.py # finds and fixes errors, using ...
  fixer/find_good_id.py # looks up a node-rev id, ignoring offset
  fixer/__init__.py # empty file, defines this as a Python module

When the script sees a 'Corrupt node-revision' error message, it looks
up the node-rev id ignoring its offset part. If found, it substitutes
the correct full id wherever it occurs in the revision file. It expects
this change to result in a checksum error being reported next, and so it
substitutes the calculated checksum as reported in the error message.
(In fact, it assumes that any checksum error being reported should be
simply corrected in this way.)

For the second type of 'Corrupt node-revision' error, I could not find a
simple rule to determine when a node-rev id was wrong in this way so I
hard-coded that one specific substitution into the script.

When the script sees a 'Corrupt representation' error, it searches for
all representations in the target revision and, if exactly one of them
has the expected length, it substitutes the offset of this one.

LIMITATIONS & IMPROVEMENTS
--------------------------

The script's algorithm is crude and could do with improvement in several
respects if it is to be used more widely.

It doesn't respect checksums. When fixing a node-rev id, it should
update only the corresponding checksum rather than assuming that any
reported checksum error is the sole result of this fix. When fixing a
representation offset, it should ensure the rep that it finds is in fact
the right one, probably by checking the checksum.

Detecting and fixing the second type of 'Corrupt node-revision' error
could probably be automated.

It doesn't replace a wrong byte-offset if the correct byte-offset has a
different number of digits. I didn't encounter a need for this. This
would be very difficult in the general case. It might be possible to
cope with a length reduction by padding with leading zeros, or some
other trick.

It uses simple text search and replace, whereas it should parse the
revision file to avoid the possibility of false matches of metadata
within user data sections.

The script is currently split into several short files and would be
better as a single script. Or it could perhaps be incorporated into
'fsfsverify.py' or something else.

CONCLUSION
----------

I hope this analysis and script will be useful to other people. I have
heard a few reports now of this kind of corruption, and we still have no
handle on how it happens.

Please let me know any thoughts or questions, the results of any use you
make of it, or anything I can do to help.

- Julian

application/x-shellscript attachment: fix-repo

text/x-python attachment: __init__.py

text/x-python attachment: find_good_id.py

text/x-python attachment: fix-rev.py

application/x-compressed-tar attachment: fix-fsfs-corruption-1.tgz

Received on 2010-10-06 12:22:30 CEST

This message: [ Message body ]
Next message: Julian Foad: "Change the order of NODES table columns"
Previous message: Philip Martin: "Format 20 upgrade to NODES"
Next in thread: John Szakmeister: "Re: Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors"
Reply: John Szakmeister: "Re: Fixing FSFS 'Corrupt node-revision' and 'Corrupt representation' errors"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]