[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Error: Found malformed header in revision file

From: Malcolm Rowe <malcolm-svn-dev_at_farside.org.uk>
Date: 2007-02-28 07:03:49 CET

Hi nextgens,

Couple of questions for you below.

On Sat, Feb 24, 2007 at 10:31:25PM +0000, NextGen$ wrote:
> Today I have experienced a weird problem with subversion... I googled
> a bit for it and found
> http://svn.haxx.se/users/archive-2004-12/0453.shtml ... So I joined
> the irc channel where I have been told to mail this list.
>
> Here is what has occured : On the client side I have tried to commit
> svnadmin verify /path/to/repo says:
> [...]
> * Verified revision 11911.
> svnadmin: Found malformed header in revision file
>

So I took a look at this problem on IRC right after nextgens posted this
message.

Interestingly, it's not the 'normal' FSFS corruption problem
('underlapping' repeated writes while writing a delta rep), though it
does involve data duplication. The problem is at the first noderev:

DELTA...
ENDREP
DELTA...
ENDREP
id: 7x2.0.r11912/988
type: file
pred: 7x2.0.r11909/178
count: 55
text: 11912 831 132 42239 29931bab8843c518b26d9ebef4c9a131
cpath: /trunk/freenet/src/freenet/clients/http/StatisticsToadlet.java
copyroot: 0 /

id: 7x2.0.r11912/988
type: file
pred: 7x2.0.r11909/178
count: 55
text: 11912 831 132 42239 29931bab8843c518b26d9ebef4c9a131
cpath: /trunk/freenet/src/freenet/clients/http/StatisticsToadlet.java
copyroot: 0 /

PLAIN...

As you can see, the noderev's been duplicated. The effect of this is to
break all the subsequently recorded offsets for the following noderevs
and the trailer line. Removing one of those duplicates with dd fixed
the problem.

So here's the question - how on earth did that happen? We only ever
write these out to the revision file once, in fs_fs.c:write_final_rev(),
and we get the offset to use just before we write each one.

Okay, so one explanation of why the file offset is wrong might be
because, for buffered files, APR doesn't call lseek() to get the current
file offset, because it can't. It uses it's own concept of the
location, so if that gets out of sync, we'd see what we have here.

How could that have happened? I'm not entirely sure. The sequence of
events here is:

1. We get the offset for the noderev, which has the side effect of
   flushing the APR file buffer to disk. We know this offset is
   correct because the offset for the noderev is correct.
2. We write out the noderev using a series of apr_file_write()'s.
   This puts all the data into the APR file buffer (it's smaller than 4K,
   the APR buffer size).
3. We get the offset for the directory rep (PLAIN), which flushes the
   noderev to disk. We know this offset is incorrect because of the
   offset reported in the text part of the directory noderev (not shown).
   [Hmm, actually I guess we don't know that for sure, but I'm going
   with that theory for the moment.]

I'm pretty sure we're not trying to write out the noderev twice (i.e.
calling write_final_rev() twice for the same file entry), since we
unconditionally update the offset in the node id, and so the two
noderevs above would have different offsets.

So I'm thinking that somewhere in #3, the APR file buffer is written
twice. I can think of only a few ways to achieve that effect, and none
of them seem particularly likely:

a. There's a journalling filesystem in use, and it's decided to replay
   the write of the noderev. I don't think this is how journalling
   filesystems actually work, but I might be wrong.
   [nextgens: What OS and filesystem are you using for the repository?]

b. There's a bug in APR's apr_file_flush() that's causing the same
   buffer to be flushed to disk more than once.

c. There's a bug in APR's apr_file_seek()/setptr() that's causing the same
   buffer to be flushed to disk more than once.
   [nextgens: Can you tell us what version of APR you're using? (as
              precisely as you can)]

d. Something Else.

For b., I can't really see where a bug would be, but I wouldn't mind
someone else taking a look once nextgens confirms which version of APR to
look at. I assume that nowadays we can rely on write(2) not returning
EINTR after a partial write - anyone know if this isn't the case?

For c., I again can't see anything obvious, but the implementation is
non-trivial. Now I know that APR 0.9.7 fixed a problem whereby errors
encountered while flushing during seek weren't reported, but that's why
we raised the minimum APR requirement to 0.9.7... and in any case, I
can't see how that could have this effect.

Regards,
Malcolm

  • application/pgp-signature attachment: stored
Received on Wed Feb 28 07:04:07 2007

This is an archived mail posted to the Subversion Dev mailing list.