[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Proposed resolution: Standardizing on UTF-8 isn't enough

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: 2007-07-18 16:15:15 CEST

(Management summary at the end)

State of the world as we know it
=======================

Filesystem behaviours:
- MacOS X (userland) filesystem APIs are NFD (enforced)
- Window and Linux filesystem APIs are locale dependent,
  but recoding routines prefer NFC
- Neither Linux nor Windows will enforce NFC path names
  when storing (any kind of) Unicode

Repository content of existing repositories
- We may expect NFC and NFD paths in existing repositories
Especially, in Mac only environments, NFD paths may work
without problems.

Choosing a standard Unicode Normal Form
===============================

There may be different ways to resolve the NFC/NFD problem within
Subversion. One of the big concerns is how we want to handle data in
existing repositories.

(1) Recode all paths on Mac to NFC and assume all other systems submit
NFC without checking.
(2) Not standardize on any encoding at all, but make Subversion aware
of the different unicode forms by adding an additional dependency to
do agnostic comparisons thoughout the code base.
(3) Recode all paths on all systems to NFC (even though this may be a
no-op most of the time on Linux/Windows)

Existing repository concerns
(1) and (3) are the least invasive in the code base, but require
existing repositories to be checked (and patched) for NFD paths,
because the code base will start to assume all internalized paths are
NFC.

(2) Is much more invasive, but in that solution, all existing
repositories can stay the way they are and the fixed code
automatically does the right thing (ie with no need for verification
and patching from the admin).

NFx <-> Local filesystem interaction
Choosing a standard (and choosing NFC at that) interacts well with the
preference of Linux/Windows to create NFC path names. Mac OS X
enforces NFD, so we can't create incorrectly encoded pathnames there.
Standardizing on NFD is not a good option, because Windows/Linux
prefer creation of NFC filenames and don't protect agains having 2
files with the same name and different encodings: we'd run a high
chance of ending up with 2 files with the same name.

Additional dependency concerns
Options (2) and (3) require us to introduce a new dependency (a
library which handles Unicode normalization for us). Apart from the
additional size (anywhere from several hundred kB to 9 MB), it makes
compilation of Subversion (especially on Windows) harder again.
Option (1) doesn't have this effect: MacOS X has functions built in to
normalize to NFC. No additional dependencies would be required
anywhere.

Correctness concerns
Option (1) has the obvious correctness problem that people aren't
prohibited from creating NFD paths on other operating systems, it's
just that the recoding routines don't *prefer* that encoding. Most
people won't override the behaviour, making it a rare occasion to
encounter NFD encoded paths.

Mixed version clients concerns
In an environment where we cannot depend on clients to provide the
internally standardized NFC paths (your typical open source project
comes to mind), options (1) and (3) won't work because paths cannot be
assumed to be NFC everywhere in the system.
In this case, only option (2) is a real solution.

Old servers concerns
Old servers may send both an NFC and an NFD entry to new clients. This
can lead to the inability to check out the content of a repository.
Even worse, a supporting client can't delete the offending NFD file
(only the NFC version) because its input is recoded to NFC!

Proposed resolution
Considering the above, combined with the number of reports we have
received so far regarding creation of 2 files with the same name (on
Linux/Windows) - namely none - probably the best option is to use
option (1).
At least, that's what I was going to propose until I realized there
were mixed client version concerns. Now, I think the only option is to
go with (2).
However, we will need to think of something to be able to delete paths
from the repository from new clients (or we punt that and say it's an
admin task...)

Summary
=======

Unicode has 2 different representations, a 'defect' from which we
suffer when comparing pathnames. We need to decide what to do about
this issue in order to create a workable situation on the Mac and to
prevent people from committing the same file with the same name twice
to the repository.

The only solution which seems to work in all cases is to make
Subversion agnostic to these differences in character representation.
This is option (2). This option will require the addition of a
dependency to handle Unicode normalization. This option also has an
impact on all of the code base where we do path name comparisons.

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jul 18 16:14:27 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.