Re: Let's discuss about unicode compositions for filenames!

From: Branko Čibej <brane_at_xbc.nu>
Date: Tue, 31 Jan 2012 01:10:31 +0100

On 31.01.2012 00:14, Peter Samuelson wrote:
> [Stefan Sperling]
>> It is indeed harder because we are passing paths verbatim to sqlite.
>> I doubt having more than one form of a given path in wc.db is fun...
> That's the implementation I would like to see, to be honest. Start
> with the observation that we can treat Mac OS X NFD paths as a client
> character encoding. Now observe that it is lossy. But ... almost all
> non-Unicode client charsets are equally lossy, for exactly the same
> reason!
>
> This suggests maintaining a mapping table in wc.db between server paths
> (UTF-8, unspecified NF) and wc paths (local charset, which is
> occasionally UTF-8 with NFD).
>
> This mapping table would be maintained any time we write to the wc.
> It would be consulted any time we search for files in the wc.
>
> It's not really extra work - we have to do those UTF-8 <-> local
> charset conversions all the time anyway. This would in fact cache
> those conversions.
>
> The implementation on OS X might be a bit hairy, if there isn't an easy
> way to retrieve the real pathname of the file you just created.
> Anywhere else, we just store the pathname we just calcuated.
>

Afaik the OSX API normalizes everything to NFD automagically. So at
least on that platform there's no chance of having more than one form
for the same filename at the same time. Likewise on Windows, which
normalizes to NFC.

I don't see what you mean by "lossy" though. NFD and NFC can represent
exactly the same set of characters, it's just that the representations
of some of them are different. Thus, this does not preclude normalizing
the paths in wc.db, and that's even easily automated. If such a
conversion finds a name collision ... the user is in serious trouble
already. :)

It's more likely to find such a collision on Unix than either Mac OS or
Windows (both of which normalize on the FS API level). But this case is
probably so rare that I wouldn't worry about it.

-- Brane
Received on 2012-01-31 01:11:10 CET

This message: [ Message body ]
Next message: Branko Čibej: "Re: Implicit keep-alive after reintegrate merge"
Previous message: Peter Samuelson: "Re: Let's discuss about unicode compositions for filenames!"
In reply to: Peter Samuelson: "Re: Let's discuss about unicode compositions for filenames!"
Next in thread: Bert Huijben: "RE: Let's discuss about unicode compositions for filenames!"
Reply: Bert Huijben: "RE: Let's discuss about unicode compositions for filenames!"
Reply: Peter Samuelson: "Re: Let's discuss about unicode compositions for filenames!"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]