[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Let's discuss about unicode compositions for filenames!

From: Thomas ┼kesson <thomas_at_akesson.cc>
Date: Thu, 9 Feb 2012 02:57:58 +0100

I have been interested in this issue for a couple of years and I remember it was discussed briefly at Subconf in Germany a couple of years ago.

Branching the thread here because I'd like to propose a different approach than Hiroaki. This proposition is not very different from the note "unicode-composition-for-filenames" or what Peter S, Neels and others suggested, perhaps just combining 2 changes slightly differently.

This is based on my limited understanding of WC-NG, please correct me if I make incorrect assumptions.

- Server will still accept both NFC and NFD, however, it will no longer accept collisions. Enforced by normalising to NFD before uniqueness checks during add operations (yes, might be more expensive). There will be no unified normalisation, but the subversion server will work like most filesystems; return what was given to it.

- WC currently has a column containing the name as stored on server, I assume. This column will be kept, and an additional column will be added that contains the name in normalised form. This form will be NFD for all platforms, unless one is found that normalises to NFC. This column will be used on Mac OS X to identify files and on all platforms to ensure normalised uniqueness.

Preliminary analysis of side-effects below. Regarding still supporting developers that want to test both NFC and NFD, this will still work, but not in the same directory.

On 30 jan 2012, at 13:30, Stefan Sperling <stsp_at_elego.de> wrote:

> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
>> Hi folks!
>> I read the note about unicode compositions for filenames
>> http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
>> and would like to drive the discussion.
> Hi,
> I am very happy to hear that you want to work towards getting this
> problem fixed. Thank you for your help!
> I've just re-read the unicode-composition-for-filenames notes.
> I think they are a bit outdated. For instance, they still talk about
> the 1.6 working copy format. They also don't clearly explain the problems
> with backwards compatibility we're facing here.
> We won't be able to apply your patch as it is. The problem is that
> it can break operation for some existing repositories and working
> copies.
> Generally, I think that writing code that implements a solution for
> this problem is not hard, no matter what the solution is.
> The real challenge lies in finding a solution that is backwards
> compatible with existing repositories and working copies.
> I will explain what I mean by giving examples below.
> But first, let's recap the basic problem, if only so others can more
> easily follow this discussion.
> As you know, in Unicode, some characters can be represented in two distinct
> ways: pre-composed form (NFC) and de-composed form (NFD).
> For instance, the letter ├Ą (a umlaut) can be represented by Unicode
> code point 0x00E4 ( ├Ą ), which is the pre-composed form, or by code
> point 0x0061 ( a ) followed by code point 0x0308 ( ╠ł ), which is the
> de-composed form.
> This is a basic property of Unicode. It simply contains both ways of
> representing these characters in its character tables.
> I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16,
> must also be able to represent both ways of encoding such characters.
> So when filenames are given in Unicode, a filename may contain any
> combination of NFC and NFD characters.
> Because Subversion never normalises filenames to one form or the other,
> the space of all possible filenames in a Subversion repository or working
> copy contains a large amount of redundancy. There are many filenames which
> look the same to the user but differ in terms of the Unicode code points
> used to represent them.
> For instance, imagine a filename containing 3 "a umlaut" characters
> and otherwise only characters from the ASCII set.
> There are 8 (2^3) different ways of representing this filename in Unicode,
> and hence 8 different UTF-8 byte strings which can be used in the repository
> or working copy to represent what is, from the user's point of view,
> the same filename.
> The problem we have on Mac OS X is that when we write any of these
> 8 different byte strings to the filesystem to name the file, and later
> read the filename back from the filesystem (e.g. by opening the parent
> directory and asking for a list of files it contains), we will always
> receive the name with all "a umlaut" characters expanded to de-composed
> form.
> Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms
> of the filename. If we don't use NFC for all characters in the filename,
> the filename read from disk may fail to match any name stored in meta data.
> Let's simplify the discussion a bit by assuming only two possible ways
> of encoding a filename: One with all characters normalised to NFC, and
> one with all characters normalised to NFD. We don't really need to
> consider the mixed forms for the purpose of this discussion (though it
> helps to keep in mind that they exist).
> So let's talk about what would happen if we applied your patch.
> Let's say I have a working copy which contains filenames normalised
> to NFD, as is the case on Mac OS X. The server gets upgraded to a new
> release of Subversion which contains your patch. This means the server
> will now send all paths as NFC. Let's say there are changes made to a
> file which has 3 "a umlaut" characters in its name. When I run 'svn update'
> my client will try to find the NFC form of the name in its meta-data,
> and fail to locate it because the file was stored as NFD.

Ok. Server will not change in this regard.

> So this means your patch will break compatibility with the working copy.
> Therefore, we would need to provide an upgrade path for those working
> copies. E.g. 'svn upgrade' could be modified to normalise all filenames
> stored in the DB to NFC. Problem solved.

Upgrade would create and populate new column.

> But now comes the next problem. Given a filename in NFC which we read from
> meta data, how can we locate the corresponding on-disk file if its form
> is not NFC?

Platforms known not to normalise would use current name column. Mac and any other normaliser would use the new column.

> We could of course rename the on-disk file. Except this
> won't work on Mac OS X unless we decide to use NFD encoding. So we could
> decide to also use NFD everywhere -- but this would break as soon as
> some other operating system decides to normalise to NFC, so it's not a
> good solution. We could also open the parent directory, read all the
> filenames within it, normalise them all, and then search the resulting
> list. This works, expect if a name exists twice, once in NFC form and once
> in NFD form. We'd somehow have to solve the name collision in the
> filesystem.

This way, there will be no new issues with collisions, just the same old issues on Mac but it will no longer be possible to create new such situations.

> But well, let's assume we had a way of storing NFC in meta-data and not
> caring about the on-disk form. Now things get even more complicated.
> My friend is not willing to upgrade to a new client version yet, which
> is fine because all 1.x releases of Subversion clients are supposed
> to be compatible with all 1.y releases of Subversion servers. He should
> not have to upgrade his client just because the server was upgraded.


> In his working copy, the file name is also in NFD form. When he
> talks to the server, the server provides the name in NFC. Because he
> is using the old client the client has no way of knowing how to map
> the NFC name to its local NFD file. So we've broken backwards
> compatibility for my friend.

No problem.

> But it gets worse. Recall the filesystem name collision problem
> mentioned above. This problem can also happen in the repository
> filesystem! For instance, assume that in the repository there already
> exist two filenames, one NFD, the other NFC, but they both are actually
> the same name. This currently works fine, expect on Mac OS X.
> What should be done now when the server is upgraded to normalise all paths
> to NFC? How can we still access content of the file which has the name
> in NFD form? Should one of the files be renamed in the HEAD revision?
> Or all historic revisions? Or removed from history? How do we help users
> carrying out such upgrades, without breaking existing working copies used
> by older clients which do not know anything about the NFC/NFD problem?

This solution avoids this whole mess.

> These are the questions which we'll need to answer to solve this issue.
> I honestly do not have good answers. I hope that you will find ways of
> solving these problems.
> There may even be more problems hidden here which I haven't though of yet.
> It will be quite hard to thoroughly make sure that no unforeseen problems
> will arise when this issue gets fixed one way or another. A good solution
> needs to be carefully planned, implemented, and thoroughly tested.
> I think the following caveats would be acceptable if they help
> with fixing the issue:
> - An upgrade path which optionally requires people to check all
> working copies out again, when either the server or the client is upgraded.
> Note again, this must be _optional_. Only people affected by the issue
> should have to make this choice, e.g. by changing configuration
> parameters from the default settings. By default, existing working
> copies should keep working after upgrading the client or server.
> Because imagine what would happen if an upgrade of the server broke
> many working copies checked out from a hosting service such as
> sourceforge.net -- not good.

No problem

> - An upgrade path which requires everyone to run 'svn upgrade' on their
> working copies in order to use the new client version, but not the
> new server version.

Yes, will be required.

> - An upgrade path which requires people to dump/load their existing
> repositories in order to get rid of the problem. Existing
> repositories which are left alone should keep working as they do
> today, with problems on Mac OS X clients but no problems on other
> clients (anything else would cause too much breakage and confusion).
> E.g. this step could normalise all paths in all revisions. But keep in
> mind the problem of name collisions which can happen when the same name
> exists as both NFC and NFD. Something needs to happen in this case to
> resolve the problem, ideally giving users a choice about how to proceed.

No need to dump/load. Just need to rename collisions in HEAD in order to get Mac clients back into the game.

> As you can see, there is a lot of complexity involved in fixing this
> issue. I hope you aren't discouraged by this. Someone will need to
> explore the details of these problems to fix this issue. I am not convinced
> that it is impossible to fix. We'll need to be very careful about backwards
> compatibility when making decisions. But there might be ways to achieve a
> satisfying solution nonetheless.

/Thomas ├ů.
Received on 2012-02-09 02:58:34 CET

This is an archived mail posted to the Subversion Dev mailing list.