[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Encoding problems in subversion under Mac OS X (HFS+)

From: Paul Koning <pkoning_at_equallogic.com>
Date: 2005-12-06 16:23:43 CET

>>>>> "Dave" == Dave Camp <dcamp@mac.com> writes:

 Dave> I'm not a shell guru by any means, but I'm wondering if you set
 Dave> the wrong environment variable. I'm using the following in tcsh
 Dave> and I can checkin/checkout files with non-ASCII chars just
 Dave> fine.

 Dave> setenv LANG en_US.UTF-8

 Dave> I assume that for bash, setenv becomes export.

Correct.

Without that setting (on a US Mac) things are utterly broken -- I get
errors about inability to convert string from UTF-8 to native
encoding.

With that setting, a checkout or update with a non-English letter in
the filename succeeds. However, past that point things are still
badly broken, as Balázs mentioned.

Test case:

1. On Windows, create a file á.txt (a with accent.txt). I used
   TortoiseSVN for that, though I assume that isn't critical. Commit
   it.

2. svn update on the Mac. The update reports that it added that new
   file, and things look reasonable. The message shows the name
   correctly.
   (Curiously enough, "ls" butchers the name. Bad Mac...)

3. Do "svn status". File á.txt is shown with status "?".

4. Edit á.txt. File á.txt is now shown twice, once with status ?,
   once with status M.

I don't read UTF-8 coding all that well, but it looks to me like
.svn/entries has á.txt listed with the accented a in its combined
(0x00E1) form. And, judging from the butchered output that Mac ls
gives me, Balázs is correct in saying that HFS+ uses the separated
("a" then the accent) form.

The problem here is that both are valid. In fact, for some languages
things get messier yet: if you have several diacritical marks on a
letter, as happens all the time in Vietnamese, those marks can occur
in any order.

The point I was making earlier with my reference to the Stringprep RFC
and a canonical encoding of UTF-8 strings is that you have to map all
these various equivalent UTF-8 strings to a single encoding before you
compare them. If you do that, then "svn status" would no longer be
confused because it would recognize the file name as given in
"entries" and the file name returned by the HFS+ file system as
equivalent strings, even though their raw encoding is not the same.

I looked briefly at the code, but it wasn't obvious to me where this
sort of thing would have to be inserted.

Interestingly enough, things seem ok in the other direction. If I add
a file ü.txt on the Mac, commit it, update on the Windows side, change
it, commit that, all is well. In this case, the separated form of the
name encoding appears in the repository and Windows doesn't appear to
have any objection to that.

But if I then rename the file on Windows and commit that rename,
things are broken again for the same reason as before. The other
oddity is that "svn update" on the Mac end adds the new file but
doesn't remove the old file (previous name) from the working
directory. Separate bug?

            paul

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Tue Dec 6 16:32:18 2005

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.