[Jesper Steen Møller]
> This is your umlaut ö "decomposed". File systems on OSX are expected
> to do this (I know very litttle OSX stuff, but stumbled upon this:
> <http://developer.apple.com/qa/qa2001/qa1173.html>) This is NFD
> (normalization form "decomposed", as opposed to FNC, C for
> "composed"). There is also NFKD and NFKC which adds "kompatibility"
> into the mix, for things like ligatures (whether fi and ff are single
> glyphs or not).
Right, we can ignore NFKC and NKFD.
Anyway, the problem here is that subversion normalises filenames (and
other input) to UTF-8 but not to a specific normalisation form.
Assuming that every user of a given repository will use the same
normalisation form is conceptually not much better than assuming
they'll all use the same character set.
You guys should pick one and start enforcing it. NFD is more elegant,
but NFC is more efficient and probably more widely used today, so
that's what I'd suggest using.
Would the following be too small for a Summer of Code project?
- Autoconfage to look for and use libicu if available
- When converting user input to utf-8, also normalise it to NFC
- Arrange for compatibility with existing repositories full of
non-normalised filenames. Probably by storing new data as NFC but
normalising old filenames read in by libsvn_ra. As part of this,
investigate whether any common tools will produce spurious noise
when looking at a repo whose NF suddenly changed one day.
Received on Sun Apr 23 21:23:44 2006