[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

UTF-8 NFC/NFD paths issue

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: Wed, 15 Sep 2010 23:20:06 +0200

Yesterday, I was talking to CMike about our long-standing issue with UTF-8
strings designating a certain path not neccessarily being equal to other
strings designating the same path. The issue has to do with NFC (composed)
and NFD (decomposed) representation of Unicode characters. CMike nicely
called the issue the "Erik Huelsmann issue" yesterday :-)

The issue consists of two parts:
 1. The repository which should determine that paths being added by a commit
are unique, regardless of their encoding (NFC/NFD)
 2. The client which should detect that the pathnames coming in from the
filesystem may differ in encoding from what's in the working copy
administrative files [this is mainly an issue on the Mac:

Mike, the thing I have been trying to find around our filesystem
implementation is where an editor drive adding a path [add_directory() or
add_file()] checks whether the file already exists. The check at that point
should be encoding independent, for example by making all paths NFC (or NFD)
before comparison. You could use utf8proc (
http://www.flexiguided.de/publications.utf8proc.en.html) to do the
normalization - it's very light-weight in contrast to ICU which provides the
same fuctionality, but has a much broader scope.

The problem I was telling you about is that I was looking in libsvn_fs_base
to find where the existence check is performed, but I couldn't find it.

Basically what I was trying to do is: do what we do now (ie fail if the path
exists and succeed if it doesn't), with the only difference that the paths
used for comparison are guarenteed to be the same normalization - meaning
they are the same byte sequence when they're equal unicode.


Received on 2010-09-15 23:20:47 CEST

This is an archived mail posted to the Subversion Dev mailing list.