[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Evil UTF-8 Character in filename in repo causing issues on my wc

From: Stefan Sperling <stsp_at_elego.de>
Date: Wed, 22 Jun 2011 16:28:31 +0200

On Wed, Jun 22, 2011 at 03:42:42PM +0200, Vincent Lefevre wrote:
> On 2011-06-15 12:29:37 +0200, Stefan Sperling wrote:
> > Unicode, and it's quirk of allowing the *same* character to be encoded
> > in *different* ways, came much later.
> >
> > I think it is unfortunate that Apple broke with the concept that a
> > filename is just a string of bytes.
>
> It's also unfortunate that Subversion breaks this concept too. :)
>
> I mean: do a checkout of a repository containing non-ASCII characters
> under Linux. Then change the locales (e.g. ISO-8859-1 -> UTF-8). Do
> an update. And see the errors...

I don't agree that this is the same problem. It's a different problem.

Subversion is internally converting path names from the native encoding
into UTF-8 and sends them to the repository because they are UTF-8-encoded
there. This way, all encodings used on client systems can be represented
in the repository. It works fine with client systems that do not support
UTF-8 natively at all, as long as they use some encoding that iconv
understands. And this is all happening *within* the application.
The rules that svn uses to create filenames are clear and consistent.
They require users not to flip locales willy-nilly, but that's the
tradeoff with relying on the locale. Locales only support one encoding
at a time.

What apple does is transform the byte sequence behind the application's back.
So the application itself cannot rely on its *own* rules it was using to
create filenames when it runs again and reads the names back from disk
because the OS is interfering with these rules.

> > When they made this decision they probably considered that it might break
> > applications and decided that the applications would have to adjust.
>
> One problem is that different applications encode accented characters
> (typed on the keyboard) differently: some of them use NFC, others use
> NFD. If they aren't regarded as equivalent, you get problems. And
> since Unicode doesn't standardize which one to use, one cannot blame
> the applications.

Yes, I fully agree here.
Received on 2011-06-22 16:29:14 CEST

This is an archived mail posted to the Subversion Users mailing list.