[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Let's discuss about unicode compositions for filenames!

From: Hiroaki Nakamura <hnakamur_at_gmail.com>
Date: Tue, 31 Jan 2012 01:42:21 +0900

Hi,

2012/1/30 Stefan Sperling <stsp_at_elego.de>:
> On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:

> Let's say I have a working copy which contains filenames normalised
> to NFD, as is the case on Mac OS X. The server gets upgraded to a new
> release of Subversion which contains your patch. This means the server
> will now send all paths as NFC. Let's say there are changes made to a
> file which has 3 "a umlaut" characters in its name. When I run 'svn update'
> my client will try to find the NFC form of the name in its meta-data,
> and fail to locate it because the file was stored as NFD.

Well, my patch is supposed to be applied to both servers and clients.
Clients with patched svn_path_cstring_to_utf8 in libsvn_subr/path.c will convert
NFD paths obtained from local filesystem to NFC on client sides.

>
> So this means your patch will break compatibility with the working copy.
> Therefore, we would need to provide an upgrade path for those working
> copies. E.g. 'svn upgrade' could be modified to normalise all filenames
> stored in the DB to NFC. Problem solved.
>
> But now comes the next problem. Given a filename in NFC which we read from
> meta data, how can we locate the corresponding on-disk file if its form
> is not NFC? We could of course rename the on-disk file. Except this
> won't work on Mac OS X unless we decide to use NFD encoding. So we could
> decide to also use NFD everywhere -- but this would break as soon as
> some other operating system decides to normalise to NFC, so it's not a
> good solution. We could also open the parent directory, read all the
> filenames within it, normalise them all, and then search the resulting
> list. This works, expect if a name exists twice, once in NFC form and once
> in NFD form. We'd somehow have to solve the name collision in the
> filesystem.

In my experiments, NFC filenames in meta-data are automatically converted
by filesystems and saved as NFD filenames on Mac OS X. I commited NFC
filenames on Windows to my Linux server, then I checkouted on Mac OS X
and I realized filenames are NFD. So we will just use NFC everywhere in
subversion.

On client side, we must first convert NFD filenames obtained from Mac OS X
filesystems to NFC, and after that we just comapre them to NFC filenames
in meta-data.

>
> But well, let's assume we had a way of storing NFC in meta-data and not
> caring about the on-disk form. Now things get even more complicated.
>
> My friend is not willing to upgrade to a new client version yet, which
> is fine because all 1.x releases of Subversion clients are supposed
> to be compatible with all 1.y releases of Subversion servers. He should
> not have to upgrade his client just because the server was upgraded.
>
> In his working copy, the file name is also in NFD form. When he
> talks to the server, the server provides the name in NFC. Because he
> is using the old client the client has no way of knowing how to map
> the NFC name to its local NFD file. So we've broken backwards
> compatibility for my friend.

I think we cannot avoid this. So this patch is for 2.x, which may
break backward compatibility.

>
> But it gets worse. Recall the filesystem name collision problem
> mentioned above. This problem can also happen in the repository
> filesystem! For instance, assume that in the repository there already
> exist two filenames, one NFD, the other NFC, but they both are actually
> the same name. This currently works fine, expect on Mac OS X.
> What should be done now when the server is upgraded to normalise all paths
> to NFC? How can we still access content of the file which has the name
> in NFD form? Should one of the files be renamed in the HEAD revision?
> Or all historic revisions? Or removed from history? How do we help users
> carrying out such upgrades, without breaking existing working copies used
> by older clients which do not know anything about the NFC/NFD problem?

If we have two files of the same filenames, one in NFC, the other in NFD,
it is really a headache for us to normalize all paths to NFC. The only thing
we can do is just keep one file of the two and throw the other file.

In reality, I think this is rare case. If we find this collision when upgrading
repositories, we should stop and provide the way for users to choose which
one to save.

>
> These are the questions which we'll need to answer to solve this issue.
> I honestly do not have good answers. I hope that you will find ways of
> solving these problems.
>
> There may even be more problems hidden here which I haven't though of yet.
> It will be quite hard to thoroughly make sure that no unforeseen problems
> will arise when this issue gets fixed one way or another. A good solution
> needs to be carefully planned, implemented, and thoroughly tested.
>
> I think the following caveats would be acceptable if they help
> with fixing the issue:
>
>  - An upgrade path which optionally requires people to check all
>   working copies out again, when either the server or the client is upgraded.
>   Note again, this must be _optional_. Only people affected by the issue
>   should have to make this choice, e.g. by changing configuration
>   parameters from the default settings. By default, existing working
>   copies should keep working after upgrading the client or server.
>   Because imagine what would happen if an upgrade of the server broke
>   many working copies checked out from a hosting service such as
>   sourceforge.net -- not good.

Exising working copied may have NFD filenames, so if upgrade is optional,
we must take care of them. However, it is easy. We just always convert
filenames obtained from working copies meta-data to NFC before any
comparisions.

>
>  - An upgrade path which requires everyone to run 'svn upgrade' on their
>   working copies in order to use the new client version, but not the
>   new server version.
>
>  - An upgrade path which requires people to dump/load their existing
>   repositories in order to get rid of the problem. Existing
>   repositories which are left alone should keep working as they do
>   today, with problems on Mac OS X clients but no problems on other
>   clients (anything else would cause too much breakage and confusion).
>   E.g. this step could normalise all paths in all revisions. But keep in
>   mind the problem of name collisions which can happen when the same name
>   exists as both NFC and NFD. Something needs to happen in this case to
>   resolve the problem, ideally giving users a choice about how to proceed.

I agree.

>
> As you can see, there is a lot of complexity involved in fixing this
> issue. I hope you aren't discouraged by this. Someone will need to
> explore the details of these problems to fix this issue. I am not convinced
> that it is impossible to fix. We'll need to be very careful about backwards
> compatibility when making decisions. But there might be ways to achieve a
> satisfying solution nonetheless.

Like other people say, we should prohibit the NFC/NFD same filename collision,
not in the subversion system, but in operational rules, just don't do that.

Then, the rest problem seems rather simple. Convert *all* input paths to NFC
first, then do the work. All input means paths passed to servers from clients,
paths obtained by servers from repositories, paths obtained by clients from
working copies. Is that correct?

-- 
)Hiroaki Nakamura) hnakamur_at_gmail.com
Received on 2012-01-30 17:42:54 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.