[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn checkout - special characters in file name are not encoding properly

From: Vincent Lefevre <vincent-svn_at_vinc17.net>
Date: Wed, 11 Aug 2010 00:31:48 +0200

On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
> On Tue, Aug 10, 2010 at 07:44:35PM +0200, Vincent Lefevre wrote:
> > This is easy (at least from the specification point of view): once the
> > encoding has been determined[*], typically at checkout time, store the
> > encoding in the WC metadata (with the current WC layout, that would be
> > some file under the .svn directory), so that the next time the svn
> > client is used for this WC, the same encoding will be used, avoiding
> > inconsistencies (such as currently obtained by two "svn up" under two
> > different locales).
> I doubt this can be made to work properly. A feature like that is just
> asking people to shoot themselves in the foot.

I don't see any problem with it. If you want another method, then fine,
but in any case, a command like "svn up" should not fail just because
it is executed under locales unexpected by the client.

> People simply should not mix character sets like that in their
> working copies.

It seems that you didn't understand what I proposed. My proposal is
just to *avoid* mixing character sets in filenames (contrary to what
svn currently does), i.e. to use a single character set, defined at
checkout time (for instance).

> There should be a project-wide convention about the encoding used for
> filenames, and everyone should be using that encoding

For the repository, of course, but it is already the case: UTF-8.
For working copies, if a single encoding must be defined, it should
be UTF-8 too, in particular to be sure to be able to represent all
the filenames that can occur.

> (unless there
> really is a project-specific need to have filenames in multiple encodings
> for some reason, but that's really rare -- and whoever does this should be
> smart enough to deal with the consequences).
> Right now, if the filename cannot be represented in the current locale,
> you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"

which is bad and prevents users from writing POSIX-conforming scripts
using svn, i.e. under the POSIX locale (except on systems where the
POSIX locale uses UTF-8, but I don't know any).

> The native encoding is determined by the locale, but that does not matter.
> The point is that, wherever encoding configuration happens to come from,
> if the configured encoding cannot represent the character string stored
> as UTF-8 in the repository, what is Subversion supposed to do? It cannot
> really do anything with a filename it cannot represent in the character
> set configured by the user, other than throwing an error.

For filenames stored on disk, they (all of them) can be encoded using
UTF-8. Remember, filenames on a POSIX system are just sequences of
bytes. For what is output to the terminal, non-representable
characters can be displayed by a replacement characters such as "?".
This can still be better than an error.

> The filename conversion to UTF-8 and back must not be lossy. Because
> to uniquely identify a file the client needs to send the same UTF-8 byte
> sequence it got from the server back to the server. And it needs to keep
> doing so for backwards compatibility. This is biting us on Mac OS X by the
> way, because some characters have multiple representations in UTF-8,
> see http://subversion.tigris.org/issues/show_bug.cgi?id=2464

This problem is due to the fact that Subversion doesn't enforce a
canonical representation (either NFC or NFD).

Anyway there would still be problems with case-insensitive filesystems
for instance.

> > [*] There are several ways to do that, such as:
> > 1. Use a charset specified by the user in the svn config file.
> That provides no advantage over checking the current locale.

The advantage is that the user doesn't need to remember to use a UTF-8
based locale for the checkout. This would also allow the user to do
checkout by portable POSIX scripts (i.e. with LC_ALL=POSIX).

> > 2. Use the current locale.
> That's what's being done. But we're not writing the information down in the
> working copy meta data, and doing so is quite pointless as described above.

It's not pointless, or at least, something else needs to be done.
Currently "svn up" fails to work, and that's a problem.

Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)
Received on 2010-08-11 00:32:28 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.