Re: [PATCH] $LastChangedDate$ encoding

From: Vincent Lefevre <vincent+svn_at_vinc17.org>
Date: 2006-05-07 19:42:24 CEST

On 2006-05-07 09:07:14 -0500, Peter Samuelson wrote:
> [Vincent Lefevre]
> > > The encoding should be consistent with filenames, which are also
> > > specific to a WC.
> >
> > There's absolutely no reason why they should be the same.
>
> I gave you a reason earlier. There are many situations where you want
> to embed a filename inside a file. (I said this in the context of XML
> files, but that's by no means the only example.)

You can still include UTF-8 encoded filenames. Anyway, how filenames
are encoded on the disk is OS / file system specific (e.g., under
Mac OS X / HFS+, it *must* be UTF-8). However, I don't think the
encoding used in the file contents should be OS specific.

> > * Using UTF-8 (current behavior):
> > + Pros: fixed encoding; no loss; compatible with file formats
> > based on UTF-8, which are common (UTF-8 is more or less the
> > default encoding nowadays).
> > + Cons: may be incompatible with some documents.
>
> Also may be incompatible with user expectations.

This is true for *any* choice, so that I didn't bother to mention it.

> I daresay it is very common to use Subversion in an environment where
> either you're only a single user, or all users have the same locale
> settings.

This may happen, but the opposite is much more common. All those
Subversion repositories publicly available on the Internet can be
accessed by anyone, with all the possible locales...

Even in a country, many encodings can be used. For instance, in France,
the most common are ISO-8859-1, ISO-8859-15 and UTF-8 (more and more
users are switching to UTF-8, but many of them are still using one of
the first two).

> Subversion localises everything very well - users never have
> to know or care that it is thinking in UTF-8 under the hood. The only
> instance I know where it does not do this is in keyword expansions.

This is not true. With publicly available repositories, files may be
in various encodings (mainly ISO-8859-1 and UTF-8), and Subversion
won't convert their contents into the locale encoding.

> Also, you seem to assume that the common case is files with a
> well-defined encoding, like XML documents. I doubt that. I guess it
> is more common to use Subversion to store text documents and program
> source code, not XML. And program source code rarely has a
> well-defined encoding; typically users write their comments in the same
> encoding they are using for the rest of their computing.

This may be true when a user writes a file for himself, but for
sources that are shared amongst many users (such as all free
software), this is no longer true.

> (Side note: if it were really true that "UTF-8 is more or less the
> default encoding nowadays", then this whole question would be a
> non-issue, as users would all be using UTF-8 for LC_CTYPE.)

No. LC_CTYPE may have some value, but files may have other encodings.
This is particularly true with XML, but can occur with other files
(e.g. ChangeLogs in Debian are encoded in UTF-8). Again, remember
that files may be shared, in particular when using Subversion.

> > * Using the encoding specified by the locales:
> > + Pros: compatible with tools that don't understand encodings
> > different from the one specified by the locales.
>
> Which is to say, most tools with most file formats.

I'd say the opposite. But this really depends on what tools people
are using.

> At least on my Unix box, very few tools I use automatically recode
> file content when outputting to my terminal. I can only think of
> vorbiscomment and iconv.

I can cite: emacs, mutt, various web browsers, screen.

> (And vorbiscomment doesn't count - I don't think you can put
> keywords into ogg vorbis files, since they expand to variable
> lengths.)

No problem. From the Subversion book:

  Subversion 1.2 introduced a new variant of the keyword syntax which
  brought additional, useful—though perhaps atypical—functionality.
  You can now tell Subversion to maintain a fixed length (in terms of
  the number of bytes consumed) for the substituted keyword. By using
  a double-colon (::) after the keyword name, followed by a number of
  space characters, you define that fixed width. When Subversion goes
  to substitute your keyword for the keyword and its value, it will
  essentially replace only those space characters, leaving the overall
  width of the keyword field unchanged. If the substituted value is
  shorter than the defined field width, there will be extra padding
  characters (spaces) at the end of the substituted field; if it is
  too long, it is truncated with a special hash (#) character just
  before the final dollar sign terminator.

-- 
Vincent Lefèvre <vincent_at_vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sun May 7 19:42:53 2006

This message: [ Message body ]
Next message: Lieven Govaerts: "tigris.org stopped archiving emails?"
Previous message: Ryan Schmidt: "Re: Interactive pre-commit hook script?"
In reply to: Peter Samuelson: "Re: [PATCH] $LastChangedDate$ encoding"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]