[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn checkout - special characters in file name are not encoding properly

From: Stefan Sperling <stsp_at_elego.de>
Date: Tue, 10 Aug 2010 20:59:00 +0200

On Tue, Aug 10, 2010 at 07:44:35PM +0200, Vincent Lefevre wrote:
> On 2010-08-10 17:42:57 +0200, Stefan Sperling wrote:
> > There are extensions in some systems like Linux, where filename encoding
> > can be specified at mount time and a process can query this information.
> > But the actual encoding of filenames might still differ (e.g. due to user
> > error). But more importantly since there is no common standard I don't
> > see how you'd solve this problem in a portable way.
>
> This is easy (at least from the specification point of view): once the
> encoding has been determined[*], typically at checkout time, store the
> encoding in the WC metadata (with the current WC layout, that would be
> some file under the .svn directory), so that the next time the svn
> client is used for this WC, the same encoding will be used, avoiding
> inconsistencies (such as currently obtained by two "svn up" under two
> different locales).

I doubt this can be made to work properly. A feature like that is just
asking people to shoot themselves in the foot.

People simply should not mix character sets like that in their working copies.
There should be a project-wide convention about the encoding used for
filenames, and everyone should be using that encoding (unless there
really is a project-specific need to have filenames in multiple encodings
for some reason, but that's really rare -- and whoever does this should be
smart enough to deal with the consequences).

Right now, if the filename cannot be represented in the current locale,
you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"

The native encoding is determined by the locale, but that does not matter.
The point is that, wherever encoding configuration happens to come from,
if the configured encoding cannot represent the character string stored
as UTF-8 in the repository, what is Subversion supposed to do? It cannot
really do anything with a filename it cannot represent in the character
set configured by the user, other than throwing an error.

The filename conversion to UTF-8 and back must not be lossy. Because
to uniquely identify a file the client needs to send the same UTF-8 byte
sequence it got from the server back to the server. And it needs to keep
doing so for backwards compatibility. This is biting us on Mac OS X by the
way, because some characters have multiple representations in UTF-8,
see http://subversion.tigris.org/issues/show_bug.cgi?id=2464

> [*] There are several ways to do that, such as:
> 1. Use a charset specified by the user in the svn config file.

That provides no advantage over checking the current locale.

> 2. Use the current locale.

That's what's being done. But we're not writing the information down in the
working copy meta data, and doing so is quite pointless as described above.

Stefan
Received on 2010-08-10 20:59:41 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.