[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: How does subversion handle encodings?

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2004-06-27 19:03:03 CEST

Marcus Sundman wrote:
> Most content management (CM) systems already convert line breaks in
> text files between the unix, mac and windows types. Text encoding is
> even more important since it has far reaching consequences (if you
> have written "foo" you do not want it to say "bar" when it's in
> production code) and errors can be very hard to detect. Despite this
> fact many CM systems seem to ignore the issue completely [...]

Ben wrote:
> [A description of Subversion's current features in this area,
> confirming that Subversion punts on character encodings of file
> contents]

There are a few things going on here:

  * Traditionally, version control systems have been used to manage
    source code, which is traditionally written in ASCII. Localized
    user messages typically come from PO files or the
    language-specific equivalent, which are stored in UTF-8 or in
    mixed charsets, not in the local charset.

  * Although many development teams mix Windows and Mac or Unix
    machines, I bet not so many mix people using different charset
    encodings.

  * A version control system is supposed to be about versioning, not
    so much about file interchange. (Perhaps a "CM system" is also
    about file interchange, but Subversion isn't a CM system.) Adding
    just the newline translation functions was bothersome to many
    Subversion developers.

At first glance, it would be consistent with Subversion's current
feature offerings, and not a tremendous amount of code, to add a
feature where you can set svn:encoding to "native" or to an LC_CHARSET
value, and Subversion would transcode the file's contents from UTF-8
to the stated encoding after newline and keyword translation. (This
would make it extra-important to fix the way "svn diff" works, so that
it translates the text-base to wc format and diffs against the wc
file, instead of detranslating the wc file to text-base format and
diffing against the text base.)

I'd worry that such a feature would wind up being more trouble to
users than it was worth, though. The moment I use a character which
can't be represented in your encoding, you can no longer check out
that file properly.

Marcus wrote:
> Therefore we are faced with three options:
> A) Get all systems to standardize on one encoding and one type of
> line breaks.

> I think the first two options are out of the question [...]

While standardizing on one type of line breaks is likely to remain
painful for a long time, standardizing on one encoding (specifically,
UTF-8) seems like a great idea. Why do you say it's out of the
question?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Mon Jun 28 20:00:08 2004

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.