[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

svndiff and XML

From: Greg Hudson <ghudson_at_mit.edu>
Date: 2000-10-11 04:01:06 CEST

(Thanks for the apr tip, Greg. Worked fine. Perhaps autogen.sh
should mention that at the end.)

(I decided not to remove the pool argument from the read and write
functions from svn_io.h after all. It's kind of convenient to have a
pool around for temporary allocations if your baton argument is
something simple like an apr_file_t *. If someone else wants to rip
them out, that's okay with me.)

Okay, the point. Guess what I found out you can't put in an XML
document? That's right, the byte 0. A byte commonly used in the
svndiff format (and the vcdiff format, if it matters). Embedding a
raw 0 byte yields the expat error "not well-formed" and embedding a
"&#0;" yields the error "reference to invalid character number". And
expat is correct according to the XML spec. (Have I ever mentioned
that I think XML is totally inappropriate for application protocols?
Well, that decision was made before I was here to get us into a
protracted argument about it, which is possibly for the best.)

Also, data with the high bit set will be intepreted as UTF-8 stuff. I
hadn't thought about this before. That may become an issue for
filenames, and perhaps escape_string() should turn high-bit data into
character refs or into appropriate UTF-8 stuff. Or maybe we should
declare that filenames are UTF-8 strings and just verify that they
contain proper UTF-8 data. I'm not sure. But we can ignore that
issue for Milestone 1.

Anyway, we have to encode our svndiff data in order to put it in XML
deltas. The two reasonable options I can see are quoted-printable and
base64. A comparison:

        * base64
          - Output expansion is always close to 33%.
          - Output doesn't include any XML-objectionable characters,
            so no need for an escape_string() pass afterwards.
          - We would no longer be sensitive to cosmetic whitespace
          - Output is totally unreadable to humans.

        * quoted-printable
          - Output expansion is 0-200% depending on input. I would
            guess about 80% for svndiff data.
          - XML-objectionable characters can be quoted if we choose to
            in our implementation.
          - Text parts of input (the svndiff "new data" and header
            come to mind) remain readable in output.

Speak up if you care. I think I will start work on a base64 encoder
this evening, but much of the work would carry over to
quoted-printable if people think that's best.
Received on Sat Oct 21 14:36:10 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.