Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Marc Haisenko <haisenko_at_comdasys.com>
Date: 2007-07-17 16:41:52 CEST

On Tuesday 17 July 2007 16:06, you wrote:
> Can somebody give a brief overview of this issue from square one,
> aimed at folks who aren't incredibly familiar with the details of
> Unicode? If we're proposing adding a new dependency it would be nice
> if more people understood the issue. (I gather it relates to the
> question of whether u-with-umlaut is stored as a thingy that says "u
> with umlaut" vs a thingy that says "the next letter has an umlaut"
> followed by a "u".)
>
> --dave

Yes, you got it right, but I'll start over (but beware: I'm by no means an
expert; I hope I get it right).

Suppose you want to encode the character "ü" (umlaut-u). You can do it in two
ways in Unicode: either use the "composed" form, which is just one character:
umlaut-u. Or you can use the "decomposed" form, which is two characters: the
first character means "add umlaut to the subsequent character" followed by a
plain latin "u". A normal strcmp will say that the two representations are
different because they are two different byte streams.

Both forms have their advantages and disadvantages. Windows wants to store
Unicode filenames in "composed" form, while Mac OS X wants to store the
filenames in "decomposed" form. This leads to problems.

There are various solutions to this problem, but they all more or less require
to have some "real" Unicode handling: either by having a strcmp that doesn't
complain when one string is composed and the other decomposed, or you
normalize each and every filename to an agreed-on representation. As far as I
can see the later is propably less error-prone (only few well-known "entries"
for filenames exist) and requires less code change. Especially if the
composed representation is used, as that already works on Windows and Linux,
so now "only" the Mac OS X client needs to normalize them as well.

Marc

-- 
Marc Haisenko
Comdasys AG
Rüdesheimer Straße 7
D-80686 München
Tel:   +49 (0)89 - 548 433 321
e-mail: haisenko@comdasys.com
http://www.comdasys.com
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Tue Jul 17 16:41:05 2007

This message: [ Message body ]
Next message: David Glasser: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Previous message: Charles Acknin: "Re: svn commit: r25768 - in branches/svnpatch-diff/subversion: include libsvn_client"
In reply to: David Glasser: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Next in thread: David Glasser: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Reply: David Glasser: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"
Reply: Freek Dijkstra: "Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]