On Mon, Jun 05, 2000 at 12:57:59PM -0400, Jonathan S. Shapiro wrote:
> > > We have a significant disagreement here. In my opinion, the archive file
> > > format is the business of the archive, and it's none of the user's
> > > whether that format is binary, ascii, or morse code.
> > I agree with the first half of that sentence but not the second. It
> > is the user's business. Why? Most obviously, because the repository
> > *will* get corrupted. Not because of filesystem or disk problems, but
> > because of bugs in the VC system. It is infinitely easier to debug a
> > text file.
> First, let me say that I agree with you -- the VC system definitely will
> corrupt things on occasion, and recoverability is a serious concern. I'm not
> altogether convinced that text formats make recoverability inherently
> easier. I think that it depends alot on the complexity of the file format.
> In the DCMS case the file format is simple enough that I'm not much
> concerned about software-generated corruption errors in the metadata, and I
> don't really expect that binary vs. text will have much of an impact on
> errors in the content.
> That said, let me add that I went for binary format for exactly the reason
> that you prefer ascii: concern about corruption. My experience suggests that
> the software is only the *second* most likely source of corruption.
> Corruption most commonly comes from having users who think they know what is
> going on edit things that they shouldn't touch. Binary formats tend to
> discourage this, and eliminate the need to deal with unpredictable string
> length management in the lexer (which in turn eliminates a major class of
> bugs). Also, binary formats eliminate the need for all of the programs to be
> neutral to line-end conventions. Finally, binary formats eliminate the need
> for a layer of canonicalization rules that must be applied to generate
> reproducible cryptographic hashes.
All these are fair points. I will say that you seem to be using a stricter
definition of 'text' than me. When I say 'text' I mean simply 7-bit ASCII,
no control characters. In particular, it would be perfectly fine to insist
on Unix-style line endings, or design the file such that you always know
how long the strings are -- I agree that's a major class of bugs it would be
nice to avoid. It doesn't have to be particularly human-readable either.
I'm glad to hear you agree with me that strong checksums are necessary. I'd
settle for CRC or adler32, but SHA sure would be nice.
The main reason I'm leery of binary files is that when they get mangled, the
dump utility usually can't cope any better than the real programs can, and
you wind up grovelling through hex dumps. But with a sufficiently robust
file format it might be possible to avoid that...
Incidentally, where can I find more information about your DCMS?
> > A secondary but still important reason is that a text file can be
> > embedded in email with no special care...
> I've seen enough mailers mangle enough text in enough creative ways that I
> would file this claim under "plausible, but falsified by bitter experience."
> Independent of that, I question the wisdom of using email as a transport for
> this sort of thing -- large emails don't survive modem line loss rates, and
> small emails don't lend themselves to ensuring consistency across
> collections of changes.
I'm mainly concerned with being able to mail a short (less than a few hundred
lines, no major file juggling) patch to a client, in a format that they can
inspect visually before they do anything with it. Context diffs do this
reasonably well; whatever we come up with should be no worse.
> Please understand: I don't dispute the merit of a text representation. I've
> just made a different engineering choice and I thought it might have value
> to articulate some of why I did this.
I do understand. Thanks for explaining so thoroughly.
Received on Sat Oct 21 14:36:05 2006