[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Poll: do we really need newline conversion?

From: Zack Weinberg <zack_at_codesourcery.com>
Date: 2001-12-11 22:34:17 CET

On Tue, Dec 11, 2001 at 08:52:53PM -0000, Barry Scott wrote:
> Yes you do: mandatory feature.
>
> I work between Unix and Win32. Tools on Unix choke on CR/LF
> * /bin/sh
> * g++
> try:
> #define fred \
> 12
> and see A syntax error.

Fixed in 3.0, but it remains a valid point.

> What is the current state of play in the Mac world? They use CR and
> choked on LF or CRLF text files in the past.

OSX uses the Unix convention.

I thought about it a bit more:

1. A file may be text in the sense of being a stream of characters in
   a standard encoding, with 0x10 and/or 0x13 bytes as newline
   indicators, without having a MIME category of "text". For
   instance, RFC 3023 specifies "application/xml" if the XML source is
   not intended to be human-readable. Therefore, being "text" or
   "binary" needs to be an orthogonal property to the MIME type.

2. Newline conversion is a special case of general character encoding
   conversion. Consider a document which is written in a combination
   of English and Russian. Some of the collaborators on this document
   have Unicode-capable editors, and the initial revision was checked
   in encoded in UTF-8. However, there are some writers who cannot or
   will not abandon KOI8-R. SVN must convert the file on checkout or
   they can't even read it.

   It is probably appropriate to convert back before generating any
   diffs or checking in, because repository operations become
   difficult if the checked-in file's encoding isn't consistent across
   revisions. (But I can see the data integrity issue arguing against
   that.)

   We don't need this generality for 1.0 but it would be good if the
   scheme we eventually settle on could be extended to support the
   general case later.

So here's my suggestion. Associate with each file two properties.
The "svn:charset" property works like the charset parameter on a
Content-Type header (see RFC 2046). The "svn:line-ending" property
says what line ending convention is used (LF, CRLF, CR)[1]. Both of
these properties indicate what the file's natural format within the
repository is. It's an error to have one but not the other. The
absence of both properties indicates a binary file.

Normally, Subversion just stores these properties, it doesn't do
anything special with them. However, users may specify a (charset,
line-ending) pair when they check out a working copy. If they do,
then all the files tagged with charset and line-ending properties
which are different, undergo conversion to the pair specified on
checkout, and back-conversion to their official pair before checkin or
diff generation.

Poke holes, anyone?

zw

[1] LFCR is a theoretical possibility, handled by gcc3 because we
heard some reports of its showing up in real life, but Subversion
probably needn't bother worrying about it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:52 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.