[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Proposed resolution: Standardizing on UTF-8 isn't enough

From: Ivan Zhakov <chemodax_at_gmail.com>
Date: 2007-07-19 12:17:45 CEST

On 7/18/07, Erik Huelsmann <ehuels@gmail.com> wrote:
> (Management summary at the end)
>
> State of the world as we know it
> =======================
>
> Filesystem behaviours:
> - MacOS X (userland) filesystem APIs are NFD (enforced)
> - Window and Linux filesystem APIs are locale dependent,
> but recoding routines prefer NFC
> - Neither Linux nor Windows will enforce NFC path names
> when storing (any kind of) Unicode
>
> Repository content of existing repositories
> - We may expect NFC and NFD paths in existing repositories
> Especially, in Mac only environments, NFD paths may work
> without problems.
>
>
>
> Choosing a standard Unicode Normal Form
> ===============================
>
> There may be different ways to resolve the NFC/NFD problem within
> Subversion. One of the big concerns is how we want to handle data in
> existing repositories.
>
> (1) Recode all paths on Mac to NFC and assume all other systems submit
> NFC without checking.
> (2) Not standardize on any encoding at all, but make Subversion aware
> of the different unicode forms by adding an additional dependency to
> do agnostic comparisons thoughout the code base.
> (3) Recode all paths on all systems to NFC (even though this may be a
> no-op most of the time on Linux/Windows)
>
>
> Existing repository concerns
> (1) and (3) are the least invasive in the code base, but require
> existing repositories to be checked (and patched) for NFD paths,
> because the code base will start to assume all internalized paths are
> NFC.
>
> (2) Is much more invasive, but in that solution, all existing
> repositories can stay the way they are and the fixed code
> automatically does the right thing (ie with no need for verification
> and patching from the admin).
>
>
> NFx <-> Local filesystem interaction
> Choosing a standard (and choosing NFC at that) interacts well with the
> preference of Linux/Windows to create NFC path names. Mac OS X
> enforces NFD, so we can't create incorrectly encoded pathnames there.
> Standardizing on NFD is not a good option, because Windows/Linux
> prefer creation of NFC filenames and don't protect agains having 2
> files with the same name and different encodings: we'd run a high
> chance of ending up with 2 files with the same name.
>
>
> Additional dependency concerns
> Options (2) and (3) require us to introduce a new dependency (a
> library which handles Unicode normalization for us). Apart from the
> additional size (anywhere from several hundred kB to 9 MB), it makes
> compilation of Subversion (especially on Windows) harder again.
> Option (1) doesn't have this effect: MacOS X has functions built in to
> normalize to NFC. No additional dependencies would be required
> anywhere.
>
>
> Correctness concerns
> Option (1) has the obvious correctness problem that people aren't
> prohibited from creating NFD paths on other operating systems, it's
> just that the recoding routines don't *prefer* that encoding. Most
> people won't override the behaviour, making it a rare occasion to
> encounter NFD encoded paths.
>
> Mixed version clients concerns
> In an environment where we cannot depend on clients to provide the
> internally standardized NFC paths (your typical open source project
> comes to mind), options (1) and (3) won't work because paths cannot be
> assumed to be NFC everywhere in the system.
> In this case, only option (2) is a real solution.
>
> Old servers concerns
> Old servers may send both an NFC and an NFD entry to new clients. This
> can lead to the inability to check out the content of a repository.
> Even worse, a supporting client can't delete the offending NFD file
> (only the NFC version) because its input is recoded to NFC!
>
> Proposed resolution
> Considering the above, combined with the number of reports we have
> received so far regarding creation of 2 files with the same name (on
> Linux/Windows) - namely none - probably the best option is to use
> option (1).
> At least, that's what I was going to propose until I realized there
> were mixed client version concerns. Now, I think the only option is to
> go with (2).
> However, we will need to think of something to be able to delete paths
> from the repository from new clients (or we punt that and say it's an
> admin task...)
>
>
> Summary
> =======
>
> Unicode has 2 different representations, a 'defect' from which we
> suffer when comparing pathnames. We need to decide what to do about
> this issue in order to create a workable situation on the Mac and to
> prevent people from committing the same file with the same name twice
> to the repository.
>
> The only solution which seems to work in all cases is to make
> Subversion agnostic to these differences in character representation.
> This is option (2). This option will require the addition of a
> dependency to handle Unicode normalization. This option also has an
> impact on all of the code base where we do path name comparisons.
>
> bye,
>
Hi Erik,
Thanks for great summary. For me option (1) is most reasonable. For me
this fix should be implemented at APR level. Because it's OS level
problem.

Also we can standardize that we use NFC form in Subversion, without
enforcing and checking it.

BTW I really don't like idea to add another big dependency (ICU),
since we already have library for doing utf8 stuff (apr-iconv and
windows api).

-- 
Ivan Zhakov
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Jul 19 12:16:54 2007

This is an archived mail posted to the Subversion Dev mailing list.