Re: Encoding problems in subversion under Mac OS X (HFS+)

From: Paul Koning <pkoning_at_equallogic.com>
Date: 2005-12-03 18:01:56 CET

>>>>> "Balázs" == Balázs Szab <Bal> writes:

 Balázs> Hi, I have problems using Subversion on OSX (10.4.3). I have
 Balázs> tried a few different versions and the problem is always the
 Balázs> same.

 Balázs> I have checked out a repository, which I created on Linux,
 Balázs> and it contained filenames like "statisztikák.sxc"

 Balázs> I set up the environment before I did anything:

 Balázs> export LC_CTYPE="hu_HU.UTF-8"

 Balázs> The checkout worked fine, but right after the checkout, I had
 Balázs> the following output for svn status (SVN 1.3RC4, but the
 Balázs> results are similar with 1.2.3 as well):

 Balázs> ? statisztikák.sxc ! statisztikák.sxc

 Balázs> The problem can be that (as I read elsewhere), HFS+ stores
 Balázs> the filenames in decomposed form, and since "á" has two UTF-8
 Balázs> code in composed and decomposed forms, SVN thinks that this
 Balázs> file is different what is just checked out...

That sounds plausible. This problem can appear anytime you deal with
strings that aren't plain English text -- accents, for example.

There's a standard solution designed in the IETF called Stringprep
(it's an RFC, I don't have the number handy). Basically it involves
translating the string into a single "canonical" format, so that no
matter which choice of encoding you start with, after Stringprep there
is only one possible outcome. Think of it as the UTF analog of
case-insensitive comparison.

So in order to compare UTF strings, you first run the two through
Stringprep, and after that you compare them. That way, two strings
that are the same to the user will also be the same to the program,
and any irrelevant transformations done in storing the strings (like
the HFS+ one) will not confuse things.


Received on Sat Dec 3 18:04:38 2005

