On Friday 02 December 2005 12:39, Malcolm Rowe wrote:
> Hi John,
> I've been looking at this issue for the past few days.
> I really can't see what's going wrong. I've audited the FSFS write
> code, and it looks fine. Critically, I can't see how to drive
> libsvn_fs_fs/fs_fs.c in such a way as to get the rev files we're seeing.
> The rev file is only written once (when the proto-rev file is moved into
> place), and the proto-rev file is only ever appended to (and as a stream:
> so there's no possibility of parts of it being overwritten).
You found the very same things that I did when I audited the code. I don't
see how it could have happened either, at least not on the surface (i.e., in
what we're doing).
> It's possible that there's a problem re-reading and writing the noderev
> hash, but if there is, I can't see it (and it would have to affect
> transaction noderev's only). Similarly, there could possibly be a
> problem in the code that rewrites a mutable (rev -1) noderev hash to
> point it to the current revision number - I've not checked that, but it
> doesn't seem particularly likely.
> However, there _must_ be something wrong in the FSFS code, because I can't
> explain how the 'text' line is written out with the wrong rep-offset.
> That offset is only set in rep_write_get_baton(), just before we write
> out the DELTA line (and we're not getting a repeated DELTA).
> I've looked at pool management, and I can't see anything wrong in
> that area. I've not yet run valgrind over it though, so I could have
> easily missed something. (Could someone point me towards instructions
> for enabling pool debugging please?)
Philip has pointed me to --enable-pool-debug for APR in the past. Admittedly,
I haven't had the time lately to go and do it though.
> However, the numbers don't really look random enough for this to be
> simple memory corruption: as you point out, the 'text' offset always
> points inside the delta rep (whereas I'd expect it to point into infinity
> if it was just a random overwrite).
> I think there must (also?) be something wrong in the svndiff code, since,
> as you pointed out, the windows are repeating. This is a dump from your
> r94 revision:
[snip lots of window output]
It looks like we're both decoding the same thing. :-)
> (For this file, if the expanded size is correct, I'd expect 0x62 full size
> (target size 0x19000) windows, and one smaller window).
> The first 0x60 windows look completely regular (I assume the source window
> dropping to zero is because the source stream emptied; and I assume the
> large data sizes in the final windows are because the source window was
> not useful for a diff in those cases).
> Window 0x61 appears completely broken: it's moved backward in the source
> stream, it's got a zero target length, a zero instruction size, and an
> enormous data size. My hacky program doesn't think there's enough data
> left in the rep to get to the next window. (Could you double-check
> my decoding?). Possibly worth noting that the 'text' offset points
> inside window 0x60, which is the last one that looks good, so window
> 0x61 might be just junk.
> It's also interesting that if we subtract the 'text' size (1347483) from
> the ENDREP offset (2596801), we get 1249318, which is 19 bytes after the
> 'text' offset (1249299), which is exactly the correct number of bytes
> for the DELTA line, implying that b->rep_offset and b->delta_start _both_
> had to be wrong, but wrong in the correct way, if you see what I mean.
Yep. I actually noted this on the dev@ list a while back. This particular
file was extremely weird. You can also find a large repeated block of data
in the svndiff too. Here's the gory details:
Note the size of the repeated block. Does your head hurt yet? :-)
> Of course, the only DELTA line (and svndiff0 header) appears at offset
Yep. In fact, it's the only modified file in the commit. Makes it more
interested as to why this error occurred. There wasn't a great deal of
files, and it wasn't a large file either. :-/
> Other stuff:
> I've taken a look at the Red Hat packages of Subversion 1.2.1, and they
> look fine - there are build patches, but nothing that looks relevant.
Good to know.
> APR 0.9.x looks fine as well (there are some recent bugfixes, but nothing
> that looks significant, whatever version people were using). I haven't
> checked Red Hat's APR package though, so they might potentially have
> local patches in it.
> The APR file-handling code looks solid: the buffering is a bit complex,
> but it doesn't look like it should be causing a problem (it's mostly
> thin wrappers around open/read/etc). The only bugfix in recent memory
> won't affect us.
I'm wondering if we even need buffering when writing the rev file? It
appeared that everything was being processed sequentially, so there's no real
benefit to the buffering.
> I wondered whether this might be caused by a disk-write error somehow
> (since you mentioned failing hardware in some cases): if a transient
> error caused a failure at transaction commit time, I wondered if a
> higher-level function could get an error and try to re-run the commit.
> However, even if that was the case, I can't explain the behaviour we're
> seeing now (and I think that disk-write errors probably would just kill
> the commit completely).
/me agrees. I think we'd die, but even if we didn't, I'd expect things to be
corrupted in a very different way. :-/
> In short, I'm a bit stumped. I think we need more real data (the output
> from your tool is useful, but it only goes so far). What I'd really
> like is a complete un-recovered repository, and possibly an uncorrupted
> copy of the corrupt revision's file if possible. And a pony.
I'd like that as well. I can work with one guy to get you access to a
complete, un-recovered repository, but he doesn't have an uncorrupted copy of
the file in question. :-( I'm not sure it's worth the trouble unless we can
tickle the failure ourselves, and I don't believe we'll be able to do that
without an uncorrupted copy of the file.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Sat Dec 3 10:00:44 2005