[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Let's discuss about unicode compositions for filenames!

From: Bert Huijben <bert_at_qqmail.nl>
Date: Mon, 30 Jan 2012 17:47:53 -0800

> -----Original Message-----
> From: Branko Čibej [mailto:brane_at_xbc.nu]
> Sent: maandag 30 januari 2012 16:11
> To: dev_at_subversion.apache.org
> Subject: Re: Let's discuss about unicode compositions for filenames!
> On 31.01.2012 00:14, Peter Samuelson wrote:
> > [Stefan Sperling]
> >> It is indeed harder because we are passing paths verbatim to sqlite.
> >> I doubt having more than one form of a given path in wc.db is fun...
> > That's the implementation I would like to see, to be honest. Start
> > with the observation that we can treat Mac OS X NFD paths as a client
> > character encoding. Now observe that it is lossy. But ... almost all
> > non-Unicode client charsets are equally lossy, for exactly the same
> > reason!
> >
> > This suggests maintaining a mapping table in wc.db between server paths
> > (UTF-8, unspecified NF) and wc paths (local charset, which is
> > occasionally UTF-8 with NFD).
> >
> > This mapping table would be maintained any time we write to the wc.
> > It would be consulted any time we search for files in the wc.
> >
> > It's not really extra work - we have to do those UTF-8 <-> local
> > charset conversions all the time anyway. This would in fact cache
> > those conversions.
> >
> > The implementation on OS X might be a bit hairy, if there isn't an easy
> > way to retrieve the real pathname of the file you just created.
> > Anywhere else, we just store the pathname we just calcuated.
> >
> Afaik the OSX API normalizes everything to NFD automagically. So at
> least on that platform there's no chance of having more than one form
> for the same filename at the same time. Likewise on Windows, which
> normalizes to NFC.
> I don't see what you mean by "lossy" though. NFD and NFC can represent
> exactly the same set of characters, it's just that the representations
> of some of them are different. Thus, this does not preclude normalizing
> the paths in wc.db, and that's even easily automated. If such a
> conversion finds a name collision ... the user is in serious trouble
> already. :)
> It's more likely to find such a collision on Unix than either Mac OS or
> Windows (both of which normalize on the FS API level). But this case is
> probably so rare that I wouldn't worry about it.

Last time we discussed this in depth (a few years ago), Windows didn't perform the normalization you describe here.
Was this added later? (Any documentation pointers?)

I think the keyboard/editor support performs some normalization so users are unlikely to create the sequences not-normalized, but our old documents say that it just stores whatever it gets passed.
(Probably for the same reason as Subversion does it: compatibility with the time where we didn't know about these problems)

> -- Brane
Received on 2012-01-31 02:48:33 CET

This is an archived mail posted to the Subversion Dev mailing list.