On Sunday 27 November 2005 18:11, Malcolm Rowe wrote:
> On Tue, Nov 22, 2005 at 06:33:34PM -0500, John Szakmeister wrote:
> > [lots of words about a bug in FSFS]
> Ok, there's quite a lot of information there.
> I've a couple of thoughts, some of which are probably obvious:
> * The fact that a lot of the problems were on Subversion 1.2.1, Redhat,
> and mod_dav_svn may be more reflective of the usage of those things in
> general, rather than being an indicator of any possible correlation with
> this problem.
> - Just to eliminate the possibility, have you any idea whether the
> RedHat/Fedora Subversion packages contain any local patches?
I have no idea... I don't use either one of those distributions, and I haven't
had any time to go and chase down what they're doing. :-(
> * I'm not sure I fully understand the problem yet, but I can't see how
> failing hardware could necessarily have caused us to fail in the way I
> think you're indicating.
/me nods. I absolutely agree, which is why I think we have an issue in the
FSFS code. :-)
> * There don't appear to have been any major changes to FSFS since 1.2.0,
> so this bug, if it is in Subversion itself, probably still exists.
There was only one that I can remember, and that was a dir-caching bug. IIRC,
it only applied to directories that were deleted, so I can't really see how
that would have anything to do with this particular situation.
> * You mentioned that you managed to get hold of the repository in some
> cases. Did you try re-running the transaction to see if the problem
> was reproducible or not? (in the cases where you were able to restore
> the file's contents, naturally).
I only had the repository, not the original data. In fact, in *every* case,
they found out they had an issue far after the fact. I crept up when the
skip delta algorithm finally got around to using the revision containing the
corrupted file. :-( So no one had the original data around that caused the
problem. Otherwise, I would have definitely tried to replay the transaction.
> * You mentioned that it was only delta representations for binary files
> that were affected. Were they particularly large? Were the original
> files compressed? (If they were compressed, the self-compressed deltas
> would tend to be close to the size of the file, I suspect - and depending
> upon the change, regular deltas might be as well)
Lots of questions there. No, not all of the files were particularly large.
In one case, then entire commit was about 2.5MB, which consisted solely of
the single file that was self compressed and corrupted (it was a MS Word
file). In another case, one of the corrupted commits was 36MB in size. The
corrupted file was supposed to have a delta running 2.6MB in length, and the
expanded length of the file was more on the order of 4.6MB. Again, it was
self compressed. In another revision of the same repository, the commit was
only 27MB in size. This time it the rep was deltified against a previous
version. The affected file was supposed to have an svndiff running about
2.7MB in length and have an expanded size of 27.5MB.
Actually, the delta algorithms we've been using are binary delta algorithms,
so they actually do pretty decent if the file wasn't compressed to begin
> - I wonder if the fact that it was binary files was just due to the
> fact that, proportionally, they'd be more likely to take up most of
> the space in a rev file.
Could be. But I've got nothing to suggest that it's solely triggered by
binary files, so I'm keeping an open mind about the problem. :-) But that
was the trend that I saw.
> To the specific problem:
> > * In every case there was an extra block of data present in the svndiff.
> > In one case, it appeared that the extra data was actually a repeat of
> > block elsewhere in the stream.
> > * In every case the actual svndiff contents were fine (there were no bad
> > instructions). The windows themselves seemed to be complete.
> > * In every case, all other offsets within the file pointed exactly where
> > they should (meaning that somehow the data was there when we wrote the
> > revision out).
> > * In one case, I was actually able to recover the contents of the file
> > completely (the very start of the svndiff stream was there).
> I'm not _quite_ sure I get exactly what the problem was. When you say
> that there was an extra block of data in the svndiff, do you mean in the
> svndiff itself, or in the representation in the rev file? In other words,
> was it the DELTA-ENDREP that was corrupt (containing a valid svndiff
> and something else), or was it the svndiff itself that was corrupt
> (containing garbage after the used 'new data', or similar).
> Where was the extra block of data? Before, after, or inside the correct
> data? Did it overwrite any valid data, or was it just 'extra'? (I
> guess if you couldn't recover the representation in all the cases,
> it was an overwrite?)
Good questions. They're actually really hard to answer though. There was
only one instance in which the entire svndiff was found. In this case, the
extra block of data appeared before the real svndiff. In the other cases,
there were more windows than expected given the size of the delta in the text
rep field. As for whether it overwrote any data, I can't say. :-( I also
can't speak at to whether the extra block came before, after, or in the
middle of the svndiff window. What I can say is that the extra data all
seemed to end on proper window boundaries.
> In one case, the extra data was a repeat (of what kind of data?).
> What was it in the other cases?
The case where there was repeated data, it was one of the sections of a DLL
that was repeated (debugging symbol section IIRC). I actually have the file
around still, but I'm too tired and too lazy to look right now. :-)
In the other cases, I didn't really find a correlation with any of the data
that was a part of the file, or the commit. It seemed... more random.
> In addition to the extra data, what other problems did you see? I think
> you mentioned that the node-rev had a 'text' <offset> that pointed in
> the wrong place? What did it point to, if anything?
I've been tying both problem together. In every case where this extra block
of data was present, the text rep pointed past what appeared to be the start
of the svndiff. Often time it pointed to some place inside of one of the
svndiff windows. I'm fairly certain the amount of displacement was directly
proportional to the size of the extra block, but I haven't had time nor
sufficient access to broken repositories to prove that theory.
> I don't understand 'all other offsets pointed where they should' -
> what other offsets are you referring to? (that would indicate that the
> data was written correctly originally).
What I meant was that every other node rep in the file had valid offsets for
their text and prop reps. All this really means is that this extra block of
data was present during the write. Otherwise, we couldn't account for the
extra svndiff data that's present (all the other offset would have been
written as if thee data wasn't there).
> I've spent a while tonight taking a look at the FSFS write process --
> and it looks pretty straightforward. Particularly, I can't see how
> 'DELTA' can be immediately followed by anything other than the start of
> an svndiff stream, nor how the offsets in 'text:' can be anything other
> than to what we've already written.
Yep, I've been through it several times myself. I don't see it either. :-(
> The only thing I can really think of is that we're either corrupting one
> of the structures in memory - a pool lifetime issue, maybe? - or that
> we're corrupting the file when we rewrite it, at transaction commit time.
> Neither looks particularly likely.
I agree. The problem right now is reproducibility. I've tried everything I
can think of. I've even run the stress test for days with no avail.
> > I have notes if you want to see them. :-)
> Ok, yes. It might be a waste of time (we might not be able to work out
> what broke), but then again, it might not.
There's probably still some stuff that's only in my head, but I'll give you
what I have. I also have some logs from a tool that I wrote to help fix
these issues when I didn't have access to the repository. It basically dumps
all the data structures in the rev file. You should be able to at least
"see" which file was affected. Unfortunately, I was only allowed to keep a
corrupted rev file from one person. I had to remove the remove another guys
repository from my machine after I fixed it. :-( That's actually been a real
problem in trying to diagnose this issue. I only had access to the rev file
in one case. I got access to the entire repository in another. And I got
nothing in the remaining two. I wrote my little tool to help locate the
issue and walked them through the repair process via email. :-(
I'm out of steam tonight, but I'll roll everything I can up and post it
somewhere you can get to. Expect to see a private mail with that URL some
time tomorrow morning. It'll be good to have another set of eyes looking at
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Mon Nov 28 02:47:06 2005