On Sunday 27 June 2004 20:03, Greg Hudson wrote:
> Marcus Sundman wrote:
> > Most content management (CM) systems already convert line breaks in
> > text files between the unix, mac and windows types. Text encoding is
> > even more important since it has far reaching consequences (if you
> > have written "foo" you do not want it to say "bar" when it's in
> > production code) and errors can be very hard to detect. Despite this
> > fact many CM systems seem to ignore the issue completely [...]
>
> Ben wrote:
> > [A description of Subversion's current features in this area,
> > confirming that Subversion punts on character encodings of file
> > contents]
>
> There are a few things going on here:
>
>   * Traditionally, version control systems have been used to manage
>     source code, which is traditionally written in ASCII.  Localized
>     user messages typically come from PO files or the
>     language-specific equivalent, which are stored in UTF-8 or in
>     mixed charsets, not in the local charset.
The actual source code might be written in 7-bit us-ascii, but string 
constants, comments, annotations etc. are not.
>   * Although many development teams mix Windows and Mac or Unix
>     machines, I bet not so many mix people using different charset
>     encodings.
You'd certainly lose that bet. Although Windows-1252 and ISO-8859-1 are 
quite similar they are not the same. E.g. a long dash in Windows-1252 is a 
control character in ISO-8859-1. And then when the windows guys use command 
line utilities they are suddenly using cp850 which is completely different. 
(You can of course "chcp 1252", but that leads to a plethora of other 
problems.)
Then we have many people working in different languages and they want to use 
UTF-8 or similar multi-byte encoding, which isn't well supported on other 
platforms.
All in all, many people are NOT using same character encodings, even though 
they might think they are.
> At first glance, it would be consistent with Subversion's current
> feature offerings, and not a tremendous amount of code, to add a
> feature where you can set svn:encoding to "native" or to an LC_CHARSET
> value, and Subversion would transcode the file's contents from UTF-8
> to the stated encoding after newline and keyword translation. 
I think the correct place for the encoding is in the "svn:mime-type" (see 
RFC 2045), and then you'd need a boolean attribute "svn:auto-filter" which 
indicates whether to transcode (including changing line breaks) on 
input/output or not.
> The moment I use a character which
> can't be represented in your encoding, you can no longer check out
> that file properly.
That is already the case. There are workarounds, though. E.g., you could 
replace each unsupported character with "{uN}" where N is the unicode code.
> > Therefore we are faced with three options:
> > A) Get all systems to standardize on one encoding and one type of
> > line breaks.
> >
> > I think the first two options are out of the question [...]
>
> While standardizing on one type of line breaks is likely to remain
> painful for a long time, standardizing on one encoding (specifically,
> UTF-8) seems like a great idea.  Why do you say it's out of the
> question?
Great idea, yes, but still painful. Good luck trying to get UTF-8 to work 
properly on windows or even on linux. Of course this would be best, but I 
just don't see it as very realistic within a reasonable time frame.
Then once everyone have switched to UTF-8 we get the "line break issue" 
fixed for free, since there is only one proper line break in unicode, 
namely \u2028.
- Marcus Sundman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Thu Jul  1 17:39:42 2004