[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 conversion error

From: Ryan Schmidt <subversion-2006Q1_at_ryandesign.com>
Date: 2006-04-22 14:14:37 CEST

On Apr 21, 2006, at 21:03, Aaron Montgomery wrote:

>> I can now check out the file from the subversion repository, but
>> when I run "svn status" in the directory where the files resides I
>> get:
>> ? Wien 05_12_19 - ÖVG.doc
>> ! Wien 05_12_19 - ÖVG.doc
>> The file is reported as both "not under version control" and
>> "missing". How can that be?
> I work on a text editor for Mac OS X and I know that we've had
> problems because of the way the system handles decomposed vs. non-
> decomposed Unicode characters. It is possible that SVN is expecting
> to find an decomposed Ö and you've got a non-decomposed Ö in the
> name of the file sitting in the directory. I'm not sure the best
> way to handle this. Mac OS is not very well behaved since its
> decision on how to represent UTF-8 is not the standard one (I think
> the standard says that you should use the shortest encoding and Mac
> OS X prefers to always use decomposed characters, but I'm not
> really sure). Possibly setting everything to ISO-8859 might solve
> this problem.

Yes, we had an extensive thread on this problem in December:


To summarize:

* Accented and umlauted characters have multiple valid
representations in UTF-8: "composed" (for example LATIN CAPITAL
LETTER O (U+004F) followed by COMBINING DIAERESIS (U+0308)).

* The Mac's usual HFS+ filesystem canonicalizes UTF-8 strings to the
decomposed form.

* The usual Windows and Linux filesystems, and the Subversion
filesystem, do not canonicalize, meaning, infuriatingly, you can have
two distinct files in these filesystems named, for example, "Wien
05_12_19 - ÖVG.doc"

* It seems that if you create such a filename on Windows or Linux,
you end up with the composed form.

The upshot of all this is that if you create a filename with such
characters on Linux or Windows and commit it to a Subversion
repository, you cannot use that file if you check out the working
copy on Mac OS X. And that bites.

The proof is in the following pudding:

On the Linux machine (Subversion 1.2.3 client and server):

        linux$ mkdir blöd
        linux$ svn import blöd https://server/repo/bl%f6d -m ""
        Committed revision 1.

On the Mac[1] (Subversion 1.3.1 client connecting to Linux 1.2.3

        mac$ svn co https://server/repo
        A repo/blöd
        Checked out revision 1.
        mac$ svn st repo
        ? blo¨d
        ! blöd

Note that in my terminal it's even shown that way: the file with
decomposed characters (the way HFS+ canonicalized it) is unversioned,
and the file with composed characters (the one Subversion was
expecting) is missing.

My suggestion would be that Subversion should

* permit only a single form of a filename in the repository, possibly
canonicalized using stringprep, and

* for operations like "svn status", use stringprep to canonicalize
filenames provided by the client filesystem before comparing them to
the (already-stringprepped?) filenames in the files within the .svn

Balázs Szabó asked if this could be opened as a bug:


...but nobody answered this question and I cannot see such a bug
filed. I'll ask it again: anybody have any objection to this being
finally filed as a bug?

[1] That was with $LANG set to en_US.UTF-8 on the Mac. With $LANG set
to en_US.ISO8859-1, which is what I usually use, I can't check it out
at all:

        mac$ svn co https://server/repo
        svn: Can't check path 'repo/blöd': Invalid argument

Separate bug?

To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Sat Apr 22 14:16:09 2006

This is an archived mail posted to the Subversion Users mailing list.