> > We have a significant disagreement here. In my opinion, the archive file
> > format is the business of the archive, and it's none of the user's
> > whether that format is binary, ascii, or morse code.
> I agree with the first half of that sentence but not the second. It
> is the user's business. Why? Most obviously, because the repository
> *will* get corrupted. Not because of filesystem or disk problems, but
> because of bugs in the VC system. It is infinitely easier to debug a
> text file.
First, let me say that I agree with you -- the VC system definitely will
corrupt things on occasion, and recoverability is a serious concern. I'm not
altogether convinced that text formats make recoverability inherently
easier. I think that it depends alot on the complexity of the file format.
In the DCMS case the file format is simple enough that I'm not much
concerned about software-generated corruption errors in the metadata, and I
don't really expect that binary vs. text will have much of an impact on
errors in the content.
That said, let me add that I went for binary format for exactly the reason
that you prefer ascii: concern about corruption. My experience suggests that
the software is only the *second* most likely source of corruption.
Corruption most commonly comes from having users who think they know what is
going on edit things that they shouldn't touch. Binary formats tend to
discourage this, and eliminate the need to deal with unpredictable string
length management in the lexer (which in turn eliminates a major class of
bugs). Also, binary formats eliminate the need for all of the programs to be
neutral to line-end conventions. Finally, binary formats eliminate the need
for a layer of canonicalization rules that must be applied to generate
reproducible cryptographic hashes.
In the end, I'm less concerned about the file damage being recoverable then
I am about ability to detect corruption to the file. I'm more concerned
about having sufficient redundancy in the repository metadata that it can be
successfully reconstructed when things like directories have been lost.
> A secondary but still important reason is that a text file can be
> embedded in email with no special care...
I've seen enough mailers mangle enough text in enough creative ways that I
would file this claim under "plausible, but falsified by bitter experience."
Independent of that, I question the wisdom of using email as a transport for
this sort of thing -- large emails don't survive modem line loss rates, and
small emails don't lend themselves to ensuring consistency across
collections of changes. While there is need for human-readable patches, I
think that it's possible to build better tools than the ones we are using
today. I'm trying to do so, and we'll just have to see how it works out in
practice. I may be quite wrong about whether the tools I have in mind are
"better" in the field.
> You're confusing treating all files the same with storing the metadata
> in binary. It is easy to treat all files the same and still store the
> metadata as text.
Depends on how the metadata is stored. Some of the metadata should be stored
in the repository with the object itself, for exactly the reasons of
recoverability that you mention. In that case, it's format tends to be
dictated by the object format.
I suppose the bottom line is that I have more confidence in my ability to
get serialization and deserialization right in binary form, where I can
semi-rely on the fact that the user hasn't dicked with things. Also, I'm
relying on cryptographic hashes for naming, and I really want to avoid the
need to read an ascii form, canonicalize the resulting object structure, and
then compute a hash on something that isn't the persistent representation.
Please understand: I don't dispute the merit of a text representation. I've
just made a different engineering choice and I thought it might have value
to articulate some of why I did this.
Received on Sat Oct 21 14:36:05 2006