Re: How does subversion handle encodings?

From: Marcus Sundman <sundman_at_iki.fi>
Date: 2004-07-01 17:39:19 CEST

On Sunday 27 June 2004 20:03, Greg Hudson wrote:
> Marcus Sundman wrote:
> > Most content management (CM) systems already convert line breaks in
> > text files between the unix, mac and windows types. Text encoding is
> > even more important since it has far reaching consequences (if you
> > have written "foo" you do not want it to say "bar" when it's in
> > production code) and errors can be very hard to detect. Despite this
> > fact many CM systems seem to ignore the issue completely [...]
>
> Ben wrote:
> > [A description of Subversion's current features in this area,
> > confirming that Subversion punts on character encodings of file
> > contents]
>
> There are a few things going on here:
>
> * Traditionally, version control systems have been used to manage
> source code, which is traditionally written in ASCII. Localized
> user messages typically come from PO files or the
> language-specific equivalent, which are stored in UTF-8 or in
> mixed charsets, not in the local charset.

The actual source code might be written in 7-bit us-ascii, but string
constants, comments, annotations etc. are not.

> * Although many development teams mix Windows and Mac or Unix
> machines, I bet not so many mix people using different charset
> encodings.

You'd certainly lose that bet. Although Windows-1252 and ISO-8859-1 are
quite similar they are not the same. E.g. a long dash in Windows-1252 is a
control character in ISO-8859-1. And then when the windows guys use command
line utilities they are suddenly using cp850 which is completely different.
(You can of course "chcp 1252", but that leads to a plethora of other
problems.)
Then we have many people working in different languages and they want to use
UTF-8 or similar multi-byte encoding, which isn't well supported on other
platforms.

All in all, many people are NOT using same character encodings, even though
they might think they are.

> At first glance, it would be consistent with Subversion's current
> feature offerings, and not a tremendous amount of code, to add a
> feature where you can set svn:encoding to "native" or to an LC_CHARSET
> value, and Subversion would transcode the file's contents from UTF-8
> to the stated encoding after newline and keyword translation.

I think the correct place for the encoding is in the "svn:mime-type" (see
RFC 2045), and then you'd need a boolean attribute "svn:auto-filter" which
indicates whether to transcode (including changing line breaks) on
input/output or not.

> The moment I use a character which
> can't be represented in your encoding, you can no longer check out
> that file properly.

That is already the case. There are workarounds, though. E.g., you could
replace each unsupported character with "{uN}" where N is the unicode code.

> > Therefore we are faced with three options:
> > A) Get all systems to standardize on one encoding and one type of
> > line breaks.
> >
> > I think the first two options are out of the question [...]
>
> While standardizing on one type of line breaks is likely to remain
> painful for a long time, standardizing on one encoding (specifically,
> UTF-8) seems like a great idea. Why do you say it's out of the
> question?

Great idea, yes, but still painful. Good luck trying to get UTF-8 to work
properly on windows or even on linux. Of course this would be best, but I
just don't see it as very realistic within a reasonable time frame.

Then once everyone have switched to UTF-8 we get the "line break issue"
fixed for free, since there is only one proper line break in unicode,
namely \u2028.

- Marcus Sundman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Thu Jul 1 17:39:42 2004

This message: [ Message body ]
Next message: Jan Normann Nielsen: "Re: hot-backup.py not working"
Previous message: Brian Mathis: "Re: hot-backup.py not working"
Maybe in reply to: Marcus Sundman: "Re: How does subversion handle encodings?"
Next in thread: Marcus Sundman: "Re: How does subversion handle encodings?"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]