Re: Let's discuss about unicode compositions for filenames!

From: Stefan Sperling <stsp_at_elego.de>
Date: Mon, 30 Jan 2012 13:30:09 +0100

On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
> Hi folks!
>
> I read the note about unicode compositions for filenames
> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
> and would like to drive the discussion.

Hi,

I am very happy to hear that you want to work towards getting this
problem fixed. Thank you for your help!

I've just re-read the unicode-composition-for-filenames notes.
I think they are a bit outdated. For instance, they still talk about
the 1.6 working copy format. They also don't clearly explain the problems
with backwards compatibility we're facing here.

We won't be able to apply your patch as it is. The problem is that
it can break operation for some existing repositories and working
copies.

Generally, I think that writing code that implements a solution for
this problem is not hard, no matter what the solution is.
The real challenge lies in finding a solution that is backwards
compatible with existing repositories and working copies.

I will explain what I mean by giving examples below.
But first, let's recap the basic problem, if only so others can more
easily follow this discussion.

As you know, in Unicode, some characters can be represented in two distinct
ways: pre-composed form (NFC) and de-composed form (NFD).
For instance, the letter Ã¤ (a umlaut) can be represented by Unicode
code point 0x00E4 ( Ã¤ ), which is the pre-composed form, or by code
point 0x0061 ( a ) followed by code point 0x0308 ( Ìˆ ), which is the
de-composed form.

This is a basic property of Unicode. It simply contains both ways of
representing these characters in its character tables.
I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16,
must also be able to represent both ways of encoding such characters.
So when filenames are given in Unicode, a filename may contain any
combination of NFC and NFD characters.

Because Subversion never normalises filenames to one form or the other,
the space of all possible filenames in a Subversion repository or working
copy contains a large amount of redundancy. There are many filenames which
look the same to the user but differ in terms of the Unicode code points
used to represent them.

For instance, imagine a filename containing 3 "a umlaut" characters
and otherwise only characters from the ASCII set.
There are 8 (2^3) different ways of representing this filename in Unicode,
and hence 8 different UTF-8 byte strings which can be used in the repository
or working copy to represent what is, from the user's point of view,
the same filename.

The problem we have on Mac OS X is that when we write any of these
8 different byte strings to the filesystem to name the file, and later
read the filename back from the filesystem (e.g. by opening the parent
directory and asking for a list of files it contains), we will always
receive the name with all "a umlaut" characters expanded to de-composed
form.

Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms
of the filename. If we don't use NFC for all characters in the filename,
the filename read from disk may fail to match any name stored in meta data.

Let's simplify the discussion a bit by assuming only two possible ways
of encoding a filename: One with all characters normalised to NFC, and
one with all characters normalised to NFD. We don't really need to
consider the mixed forms for the purpose of this discussion (though it
helps to keep in mind that they exist).

So let's talk about what would happen if we applied your patch.

Let's say I have a working copy which contains filenames normalised
to NFD, as is the case on Mac OS X. The server gets upgraded to a new
release of Subversion which contains your patch. This means the server
will now send all paths as NFC. Let's say there are changes made to a
file which has 3 "a umlaut" characters in its name. When I run 'svn update'
my client will try to find the NFC form of the name in its meta-data,
and fail to locate it because the file was stored as NFD.

So this means your patch will break compatibility with the working copy.
Therefore, we would need to provide an upgrade path for those working
copies. E.g. 'svn upgrade' could be modified to normalise all filenames
stored in the DB to NFC. Problem solved.

But now comes the next problem. Given a filename in NFC which we read from
meta data, how can we locate the corresponding on-disk file if its form
is not NFC? We could of course rename the on-disk file. Except this
won't work on Mac OS X unless we decide to use NFD encoding. So we could
decide to also use NFD everywhere -- but this would break as soon as
some other operating system decides to normalise to NFC, so it's not a
good solution. We could also open the parent directory, read all the
filenames within it, normalise them all, and then search the resulting
list. This works, expect if a name exists twice, once in NFC form and once
in NFD form. We'd somehow have to solve the name collision in the
filesystem.

But well, let's assume we had a way of storing NFC in meta-data and not
caring about the on-disk form. Now things get even more complicated.

My friend is not willing to upgrade to a new client version yet, which
is fine because all 1.x releases of Subversion clients are supposed
to be compatible with all 1.y releases of Subversion servers. He should
not have to upgrade his client just because the server was upgraded.

In his working copy, the file name is also in NFD form. When he
talks to the server, the server provides the name in NFC. Because he
is using the old client the client has no way of knowing how to map
the NFC name to its local NFD file. So we've broken backwards
compatibility for my friend.

But it gets worse. Recall the filesystem name collision problem
mentioned above. This problem can also happen in the repository
filesystem! For instance, assume that in the repository there already
exist two filenames, one NFD, the other NFC, but they both are actually
the same name. This currently works fine, expect on Mac OS X.
What should be done now when the server is upgraded to normalise all paths
to NFC? How can we still access content of the file which has the name
in NFD form? Should one of the files be renamed in the HEAD revision?
Or all historic revisions? Or removed from history? How do we help users
carrying out such upgrades, without breaking existing working copies used
by older clients which do not know anything about the NFC/NFD problem?

These are the questions which we'll need to answer to solve this issue.
I honestly do not have good answers. I hope that you will find ways of
solving these problems.

There may even be more problems hidden here which I haven't though of yet.
It will be quite hard to thoroughly make sure that no unforeseen problems
will arise when this issue gets fixed one way or another. A good solution
needs to be carefully planned, implemented, and thoroughly tested.

I think the following caveats would be acceptable if they help
with fixing the issue:

- An upgrade path which optionally requires people to check all
   working copies out again, when either the server or the client is upgraded.
   Note again, this must be _optional_. Only people affected by the issue
   should have to make this choice, e.g. by changing configuration
   parameters from the default settings. By default, existing working
   copies should keep working after upgrading the client or server.
   Because imagine what would happen if an upgrade of the server broke
   many working copies checked out from a hosting service such as
   sourceforge.net -- not good.

- An upgrade path which requires everyone to run 'svn upgrade' on their
working copies in order to use the new client version, but not the
new server version.

- An upgrade path which requires people to dump/load their existing
   repositories in order to get rid of the problem. Existing
   repositories which are left alone should keep working as they do
   today, with problems on Mac OS X clients but no problems on other
   clients (anything else would cause too much breakage and confusion).
   E.g. this step could normalise all paths in all revisions. But keep in
   mind the problem of name collisions which can happen when the same name
   exists as both NFC and NFD. Something needs to happen in this case to
   resolve the problem, ideally giving users a choice about how to proceed.

As you can see, there is a lot of complexity involved in fixing this
issue. I hope you aren't discouraged by this. Someone will need to
explore the details of these problems to fix this issue. I am not convinced
that it is impossible to fix. We'll need to be very careful about backwards
compatibility when making decisions. But there might be ways to achieve a
satisfying solution nonetheless.
Received on 2012-01-30 13:30:49 CET

This message: [ Message body ]
Next message: Branko ÄŒibej: "Re: Let's discuss about unicode compositions for filenames!"
Previous message: Branko ÄŒibej: "Re: request to clarify and improve Subversion property name specification"
In reply to: Hiroaki Nakamura: "Let's discuss about unicode compositions for filenames!"
Next in thread: Branko ÄŒibej: "Re: Let's discuss about unicode compositions for filenames!"
Reply: Branko ÄŒibej: "Re: Let's discuss about unicode compositions for filenames!"
Reply: Markus Schaber: "AW: Let's discuss about unicode compositions for filenames!"
Reply: Julian Foad: "Re: Let's discuss about unicode compositions for filenames!"
Reply: Hiroaki Nakamura: "Re: Let's discuss about unicode compositions for filenames!"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]