[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Standardizing on UTF8 internally isn't enough

From: steveking <tortoisesvn_at_gmail.com>
Date: 2007-07-17 22:55:25 CEST

Erik Huelsmann wrote:
> This is the mail the second issue which actually is a unicode issue,
> but happens to show up on Macs more so than elsewhere.
>
> Why isn't it enough to standardize on UTF-8? Doesn't UTF-8 have 1
> codepoint per Unicode codepoint? Surely it does, but unicode itself
> has 2 ways to write some characters. Take for example u-umlaut.
> There's a unicode codepoint which says 'u-with-umlaut'. But, Unicode
> also knows a codepoint which means 'next-character-has-umlaut'. Now,
> the u-with-umlaut can be written in 2 ways:
>
> 1) 'u-with-umlaut' codepoint (1 codepoint)
> 2) 'next-char-has-umlaut', 'u' codepoints (2 codepoints)
>
> (1) and (2) mean the same, but lead to different byte sequences in
> *all* Unicode encodings and UTF-8 is no exception.
>
> Conclusion: even though we standardized on Unicode to eliminate
> character encoding problems, we now have new character encoding
> problems.
>
> (1) is known as composed (or pre-composed) and (2) is known as decomposed.
>
> Why is this problem mainly showing up on the Mac? Well, Linux and
> Windows have standardized (non-enforcing) on (1) which is known under
> the acronym NFC (Normal Form Composed) and the Mac has standardized on
> (2). The Mac however enforces this convention, meaning that when you
> create a file with a name in NFC form, when reading back the directory
> entries, you get a file which looks exactly the same, but has a
> different byte-sequence (NFD).
>
> That last bit utterly confuses Subversion, because it has a list of
> files to expect in its admin data, which now doesn't correspond
> anymore with the names it gets back from the OS.
>
>
> Because we defined we'd be using UTF-8, but didn't define we'd be
> using NFC or NFD, we now have a problem: we defined where we wanted
> the door in our house, but not what it's size would be :-)
>
> We are not the only project with this problem however (and note that
> it's not the Mac which causes this problem, using Unicode is): IBM
> created the ICU project (http://icu-project.org/) to address all kinds
> of i18n problems including Unicode normalization, collation etc.
>
> This problem could be solved to by adding the ICU lib as a dependency
> and change all path comparisons to use the ICU normalform agnostic
> comparison routines.

I don't think that ICU would solve all problems here. Because the
problem is clearly OS dependent. For example, "NTFS does no Unicode
normalization at all" (1), so it is possible to have two files with the
same name (but different composing). If you would enforce either
composed or decomposed, you'd have simply shifted the problems from the
Mac to Windows.

I would suggest to either work around this issue in apr (or apr-iconv),
or handle Macs separately in the Subversion code. Sure, they will then
have the (almost) same problem as Windows has now (can't checkout
because of two files with the same name but different case - on the Mac
that would be "can't check out because of two files with the same name
but different composing").

Also interesting to read would be (2).

Stefan

(1) http://blogs.msdn.com/michkap/archive/2006/09/10/748699.aspx
     (in a comment below the blog post)
(2) http://blogs.msdn.com/michkap/archive/2006/06/05/617447.aspx

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jul 17 22:54:41 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.