[RFC] Non-normalizing Unicode Composition Awareness (was: Let's discuss about unicode compositions for filenames!)
From: Thomas ┼kesson <thomas_at_akesson.cc>
Date: Tue, 14 Feb 2012 01:34:51 +0100
Title: Non-normalizing Unicode Composition Awareness
Within Unicode, some characters can in the unicode standard be represented in 2 different ways (composed/decomposed), while rendered equally on screen or in print. A unicode string (e.g. a file name) can be represented in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed and others decomposed (rare).
The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename in any form, store and give back in the form it was input. These file systems will typically even accept multiple files where the path looks identical on screen but the unicode string is different due to character composition.
A minority of file systems (currently Mac OS X HFS+ only) will normalize the paths. In the case of HFS+, the path will be normalized into NFD and it will even be given back that way when listing the filesystem.
Most significant differences from the majority of filesystems:
The topic has been described here:
- This RFC is not as complete in all areas, and depend on this note for additional context and issue description.
- Subversion and most file systems currently allow creation of multiple paths, which in normalized form are identical. Hereafter referred to as "normalized-name collisions". This could cause significant upgrade issues for repositories containing such collisions, depending on which solution is implemented. See section "Legacy Data".
- Users have difficulty understanding and managing "normalized-name collisions". It is difficult to know which file is which and one of the paths is typically not possible to type on a keyboard.
- Mac OS X clients can not interoperate with non-OSX clients when paths contain composed characters (added by a non-OSX client). The working copies are broken directly after checkout/update on OSX. Tracked by: http://subversion.tigris.org/issues/show_bug.cgi?id=2464
Differences to case-sensitivity
- NFC/NFD look the same when rendered on screen.
Similarities to case-sensitivity
- If two Unicode strings differ only by letter case/composition, on some computer systems they refer to the same file, while on other systems they refer to different files. The same applies if two Unicode strings differ only by composition. The rules are set by each file system.
- Subversion interoperates with different systems. When two file names that differ only by letter case are transferred from a
To Normalize or Not to Normalize
Whether or not to normalize within a Subversion repository (server-side) has been debated. The note (unicode-composition-for-filenames) considers normalization to NFC to be the long term (2.x) solution. Referring to this feature as "repository normalization".
There are implementation advantages with normalized paths which can simplify comparisons and storage.
There are also reasons not to normalize:
- A file system is generally expected to give back exactly what was stored, or refuse up-front. HFS+ has been criticized for not living up to this expectation, which is also the reason the Svn WC has issues on HFS+. Subversion can be considered a sort of file system, and could therefore be expected to live up to this expectation.
- Compatibility is a high priority for Subversion. Introducing normalization/translation/etc is not unlikely to introduce compatibility issues, now or later. There is a principle that Subversion should not be a limiting factor or impose undue limitations on allowed characters, file names etc.
- Introducing normalization tends to complicate the upgrade process, especially for repositories that contain "normalized-name collisions". This is one of the reasons this very issue has not been addressed.
However, there is very little reason to allow the creation of new "normalized-name collisions". There are no known use-cases for creating multiple files in the same directory that would have identical normalized paths. Subversion should preferably refuse such add operations as early as possible, at the latest during commit. Referring to this feature as "uniqueness normalization".
There are 2 components of this solution, one server side and one client side. These can be addressed individually, which is an important requirement for Subversion 1.x interoperability between client and server versions.
This solution does not normalize paths in the repository. Paths are only normalized for the purpose of comparisons.
The Subversion server should no longer accept 'add':ing paths that cause "normalized-name collisions". The comparison with existing paths (and other paths in the same txn) should be performed in normalized form. However, the paths created in the repository will keep the form input by the client.
There could be a performance impact. [Need more data] However, the 'add' operation is not one of the most frequent ones, in a typical installation.
It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn, and elder Subversion clients.
The desired server behavior can be accomplished with Subversion 1.7 or earlier using a pre-commit hook, but it is desirable to have "uniqueness normalization" as the future default behavior.
The Working Copy needs an abstraction between the repository path provided by the server and the actual file system path. This is required for normalizing file systems (HFS+) regardless if the Subversion server performs normalization to NFC (repository normalization) or just enforces "uniqueness normalization".
It might be more feasible to implement such an abstraction now in wc-ng than it was in svn -1.6.
[This section needs input from someone more familiar with wc-ng]
Columns of interest in wc.db:
- The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")
An abstraction between the repository path and the file system path can be achieved by ensuring that there is a column in wc.db that contains the file system path in exactly the same form that the file system gives back. APIs in wc needs to be extended to ensure that all interaction with the file system is performed with the file system path.
Redefine the existing column local_relpath to contain the path as stored in the file system. Code that currently relies on local_relpath being a substring of repos_path needs to be adjusted. E.g. a node might be considered switched when this condition is not met.
A new column, local_relpath_fs, is added that contains the path as stored in the file system. This column will be used on all systems to interact with the file system. Currently, the content of columns local_relpath and local_relpath_fs will be identical on all file systems except HFS+.
Path uniqueness should be checked in normalized form during add operations, in order to prevent "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since it is a quite rare condition.
When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath/local_relpath_fs and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensititve file systems.
This change will only affect use cases which rely on creating paths that look like duplicates but use different unicode composition. It is highly unlikely anyone is relying on this..
- This change will cause no problems when upgrading existing repositories even if they contain "normalized-name collisions".
- If "normalized-name collisions" exist in HEAD, a check out on Mac OS X will still fail after an upgrade but potentially with a better error message. This is an issue that is very similar to case-collisions on case-insensitive file systems. The detection code is similar and the same friendly error message can potentially be used.
- These "normalized-name collisions" can be resolved in HEAD via "svn mv SRC_URL DST_URL". Historical revisions will still be difficult to check out from Mac OS X.
- Working Copies will be upgraded in the same way as any other wc-ng upgrade with SQL schema changes. Working Copies on Mac OS X that are broken before upgrade might require a fresh check out.
This is an archived mail posted to the Subversion Dev mailing list.