On 2005.12.06., at 16:23, Paul Koning wrote:
> Without that setting (on a US Mac) things are utterly broken -- I get
> errors about inability to convert string from UTF-8 to native
> With that setting, a checkout or update with a non-English letter in
> the filename succeeds. However, past that point things are still
> badly broken, as Balázs mentioned.
Exactly. I set the environment variables according to the locale
> Test case:
> 1. On Windows, create a file á.txt (a with accent.txt). I used
> TortoiseSVN for that, though I assume that isn't critical. Commit
> 2. svn update on the Mac. The update reports that it added that new
> file, and things look reasonable. The message shows the name
> (Curiously enough, "ls" butchers the name. Bad Mac...)
> 3. Do "svn status". File á.txt is shown with status "?".
> 4. Edit á.txt. File á.txt is now shown twice, once with status ?,
> once with status M.
Yes, it is correct!
> I don't read UTF-8 coding all that well, but it looks to me like
> .svn/entries has á.txt listed with the accented a in its combined
> (0x00E1) form. And, judging from the butchered output that Mac ls
> gives me, Balázs is correct in saying that HFS+ uses the separated
> ("a" then the accent) form.
> The problem here is that both are valid.
> The point I was making earlier with my reference to the Stringprep RFC
> and a canonical encoding of UTF-8 strings is that you have to map all
> these various equivalent UTF-8 strings to a single encoding before you
> compare them.
Yes, the point is right.
> If you do that, then "svn status" would no longer be
> confused because it would recognize the file name as given in
> "entries" and the file name returned by the HFS+ file system as
> equivalent strings, even though their raw encoding is not the same.
> Interestingly enough, things seem ok in the other direction. If I add
> a file ü.txt on the Mac, commit it, update on the Windows side, change
> it, commit that, all is well. In this case, the separated form of the
> name encoding appears in the repository and Windows doesn't appear to
> have any objection to that.
> But if I then rename the file on Windows and commit that rename,
> things are broken again for the same reason as before.
I figured out why it is: while Windows does not care what encoding
you are using in filenames, MacOSX converts them to a canonical form
(in this case it is the decomposed form), it does not matter what
formats you used before.
So in Windows, the two "á" characters are counted as different
characters if they are encoded differently, while they are the same
in OSX (HFS+ at least, which is the default filesystem of the OSX now).
So when you checkout a file where the composed form is used, it
converts this to decomposed, so SVN thinks that the original file is
removed, and a new one is created.
You can try the following as well:
In an empty directory (on OSX), try:
svn add á.txt
The result will be a "? á.txt" or something like that.
But if you try:
svn add *
The result will be fine!
This test shows that the hungarian keyboard layout of Mac OSX
produces a composed form of the character, while it is not stored in
I did some research:
"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."
In the "http://developer.apple.com/technotes/tn/
tn1150.html#UnicodeSubtleties" link, it describes the algorithm it is
used for composing and decomposing filenames from and to UTF-8
format. This is a good reading for developers.
I am now sure that this is basically a compatibility problem between
SVN and OSX.
Is there any developer here who can say something to it? Is it easy
to fix? Is it going to be fixed (first of all :-) )?
Balázs Szabó (dLux)
-- -- - - - -- -
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org
Received on Tue Dec 6 21:03:54 2005