[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Let's discuss about unicode compositions for filenames!

From: Hiroaki Nakamura <hnakamur_at_gmail.com>
Date: Wed, 1 Feb 2012 00:11:26 +0900

* HFS+ is the default file system on Mac OS X, so we must support them.
  Forcing users to reformat their HDD and use another file system is not an
  option. It is much worse than merely upgrading subversion working copies.

* As for similarity to case sensitivity, there is a critical difference:
  case is preserved on all of Linux, Mac, and Windows.

  According to http://en.wikipedia.org/wiki/Filename
  Case sensitivity in Windows NTFS and Mac OS X HFS+ are optional, but
  disabled by default. We cannot have the same filename of diffrenct cases
  (Again, it is not realistic for users to format their HDD and turn on
  the option).

  So, it is very natural. If we have "readme" on Windows, then we have
  "readme" on Mac too. We cannot have "readme" and "README" at the same
  time, but that sounds normal to users on both camps.

* "a" and "A": diffrent characters, different looks.
               Both are easy to type in. Both used widely.
  NFC and NFD: the same abstract characters, almost same looks (*1)
               NFC is easy to type in. NFC is hard to type in (*2)
               On Windows, NFC used widely, NFD almost never used.
               On Mac, NFD only used as internal code of HFS+. The rest is NFC.
  (*1) looks same on Explorer, but different on Command Prompt.
       Actually Japanese NFD filename looks very weird on Command Prompt.
       Too much space between combined character and combining character.
       See the screenshot attached.
  (*2) I don't know the way to type in NFD in Japanese IME.

  http://unicode.org/reports/tr15/
  > The Unicode Standard defines two equivalences between characters:
  > canonical equivalence and compatibility equivalence. Canonical
  > equivalence is a fundamental equivalency between characters or
  > sequences of characters that represent the same abstract character,
  > and when correctly displayed should always have the same visual
  > appearance and behavior.

* As for NFC/NFD, Windows NTFS have the same filename of NFC/NFD.
  However we don't do that actually, because it leads to confusion.
  Different cases looks differently to our eyes, but NFC/NFD difference
  are hard to detect. It looks the same to casual users.
  So it is very rarely needed to have the same filename of NFC/NFD,
  we just treat it as an error and let users manually rename first and
  try again.

* Mac OS X HFS+ can store only NFD filenames. So if we use fictitious
  examples in analogy to case differences, it goes something like this:

  Here we suppose NFC is lower case, and NFD is upper case.
  Windows and Linux can have both form, like "readme" and "CHANGES".
  However usually we use only lower case like "readme" and "changes",
  because it is just easy to type in lower cases. We can type in upper
  cases, but we need very special skill to do that. Maybe casual users
  cannot use SHIFT or CAPS key :-)

  Also casual users don't bother to type in upper cases, because
  "readme" and "README" looks exactly same to us. (Of course it is not,
  in reality, but it is in this fictitious example).

  If we create and check in "readme" on Windows, then we check out "readme"
  on Mac, it becomes "README". It is OK for us, because it is normal.
  We always create filename like "README".  And we see "readme" and
  "README" are the same thing, it doesn't matter.

  If we create and check in "CHANGES" on Mac, then we check out "CHANGES"
  on Windows. It looks almost same as "changes", but it has some weird looks
  and feels unusual.

--
)Hiroaki Nakamura) hnakamur_at_gmail.com

NFD_looks_weird_on_Command_Prompt.png
Received on 2012-01-31 16:12:02 CET

This is an archived mail posted to the Subversion Dev mailing list.