[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: MacOSX filename encoding issue

From: Peter N. Lundblad <peter_at_famlundblad.se>
Date: 2006-04-24 13:56:04 CEST

Peter Samuelson writes:
>
> [Jesper Steen Møller]
> Anyway, the problem here is that subversion normalises filenames (and
> other input) to UTF-8 but not to a specific normalisation form.
> Assuming that every user of a given repository will use the same
> normalisation form is conceptually not much better than assuming
> they'll all use the same character set.

Agreed:-( Maybe it is "better" in the sense that it doesn't cause
problems as often as character encoding ones, but it is still a time bomb...

> You guys should pick one and start enforcing it. NFD is more elegant,
> but NFC is more efficient and probably more widely used today, so
> that's what I'd suggest using.

Just FWIW, we already required NFD in the svn_fs.h API documentation,
but we don't check or enforce it anywhere, so it is just of
theoretical interest. It might be a reason to choose that form, though.

> Would the following be too small for a Summer of Code project?
>
> - Autoconfage to look for and use libicu if available

I don't know if libicu is the best, since I know nothing about it, so
we might want to leave this choice open for further suggestions.

> - When converting user input to utf-8, also normalise it to NFC
>
> - Arrange for compatibility with existing repositories full of
> non-normalised filenames. Probably by storing new data as NFC but
> normalising old filenames read in by libsvn_ra. As part of this,
> investigate whether any common tools will produce spurious noise
> when looking at a repo whose NF suddenly changed one day.

Given this last one with all the compatibility stuff to verify, I
think this is a reasonalbe SoC project. And I really want this to
happen, because it is the wrong decade to struggle with encoding
problems like we do today... I mean, in the less common cases.

Much of the ground work for this has been laid rarlier, because we
make sure to canonicalize paths everywhere on input, and we have
special routines to convert paths to/from the system's encoding.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Apr 24 13:56:35 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.