[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS Issue...

From: John Szakmeister <john_at_szakmeister.net>
Date: 2005-12-03 09:58:51 CET

On Friday 02 December 2005 12:39, Malcolm Rowe wrote:
> Hi John,
>
> I've been looking at this issue for the past few days.
>
> I really can't see what's going wrong. I've audited the FSFS write
> code, and it looks fine. Critically, I can't see how to drive
> libsvn_fs_fs/fs_fs.c in such a way as to get the rev files we're seeing.
>
> The rev file is only written once (when the proto-rev file is moved into
> place), and the proto-rev file is only ever appended to (and as a stream:
> so there's no possibility of parts of it being overwritten).

You found the very same things that I did when I audited the code. I don't
see how it could have happened either, at least not on the surface (i.e., in
what we're doing).

> It's possible that there's a problem re-reading and writing the noderev
> hash, but if there is, I can't see it (and it would have to affect
> transaction noderev's only). Similarly, there could possibly be a
> problem in the code that rewrites a mutable (rev -1) noderev hash to
> point it to the current revision number - I've not checked that, but it
> doesn't seem particularly likely.

/me agrees.

> However, there _must_ be something wrong in the FSFS code, because I can't
> explain how the 'text' line is written out with the wrong rep-offset.
> That offset is only set in rep_write_get_baton(), just before we write
> out the DELTA line (and we're not getting a repeated DELTA).

Yep.

> I've looked at pool management, and I can't see anything wrong in
> that area. I've not yet run valgrind over it though, so I could have
> easily missed something. (Could someone point me towards instructions
> for enabling pool debugging please?)

Philip has pointed me to --enable-pool-debug for APR in the past. Admittedly,
I haven't had the time lately to go and do it though.

> However, the numbers don't really look random enough for this to be
> simple memory corruption: as you point out, the 'text' offset always
> points inside the delta rep (whereas I'd expect it to point into infinity
> if it was just a random overwrite).

Right.

> I think there must (also?) be something wrong in the svndiff code, since,
> as you pointed out, the windows are repeating. This is a dump from your
> r94 revision:
[snip lots of window output]

It looks like we're both decoding the same thing. :-)

>
> (For this file, if the expanded size is correct, I'd expect 0x62 full size
> (target size 0x19000) windows, and one smaller window).
>
> The first 0x60 windows look completely regular (I assume the source window
> dropping to zero is because the source stream emptied; and I assume the
> large data sizes in the final windows are because the source window was
> not useful for a diff in those cases).
>
> Window 0x61 appears completely broken: it's moved backward in the source
> stream, it's got a zero target length, a zero instruction size, and an
> enormous data size. My hacky program doesn't think there's enough data
> left in the rep to get to the next window. (Could you double-check
> my decoding?). Possibly worth noting that the 'text' offset points
> inside window 0x60, which is the last one that looks good, so window
> 0x61 might be just junk.
>
> It's also interesting that if we subtract the 'text' size (1347483) from
> the ENDREP offset (2596801), we get 1249318, which is 19 bytes after the
> 'text' offset (1249299), which is exactly the correct number of bytes
> for the DELTA line, implying that b->rep_offset and b->delta_start _both_
> had to be wrong, but wrong in the correct way, if you see what I mean.

Yep. I actually noted this on the dev@ list a while back. This particular
file was extremely weird. You can also find a large repeated block of data
in the svndiff too. Here's the gory details:
  http://svn.haxx.se/dev/archive-2005-09/0409.shtml

Note the size of the repeated block. Does your head hurt yet? :-)

> Of course, the only DELTA line (and svndiff0 header) appears at offset
> zero.

Yep. In fact, it's the only modified file in the commit. Makes it more
interested as to why this error occurred. There wasn't a great deal of
files, and it wasn't a large file either. :-/

>
> Other stuff:
> I've taken a look at the Red Hat packages of Subversion 1.2.1, and they
> look fine - there are build patches, but nothing that looks relevant.

Good to know.

> APR 0.9.x looks fine as well (there are some recent bugfixes, but nothing
> that looks significant, whatever version people were using). I haven't
> checked Red Hat's APR package though, so they might potentially have
> local patches in it.
>
> The APR file-handling code looks solid: the buffering is a bit complex,
> but it doesn't look like it should be causing a problem (it's mostly
> thin wrappers around open/read/etc). The only bugfix in recent memory
> won't affect us.

I'm wondering if we even need buffering when writing the rev file? It
appeared that everything was being processed sequentially, so there's no real
benefit to the buffering.

> I wondered whether this might be caused by a disk-write error somehow
> (since you mentioned failing hardware in some cases): if a transient
> error caused a failure at transaction commit time, I wondered if a
> higher-level function could get an error and try to re-run the commit.
> However, even if that was the case, I can't explain the behaviour we're
> seeing now (and I think that disk-write errors probably would just kill
> the commit completely).

/me agrees. I think we'd die, but even if we didn't, I'd expect things to be
corrupted in a very different way. :-/

>
> In short, I'm a bit stumped. I think we need more real data (the output
> from your tool is useful, but it only goes so far). What I'd really
> like is a complete un-recovered repository, and possibly an uncorrupted
> copy of the corrupt revision's file if possible. And a pony.

I'd like that as well. I can work with one guy to get you access to a
complete, un-recovered repository, but he doesn't have an uncorrupted copy of
the file in question. :-( I'm not sure it's worth the trouble unless we can
tickle the failure ourselves, and I don't believe we'll be able to do that
without an uncorrupted copy of the file.

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Dec 3 10:00:44 2005

This is an archived mail posted to the Subversion Dev mailing list.