Re: Comments on 'notes/unicode-composition-for-filenames'

From: Branko Čibej <brane_at_e-reka.si>
Date: Tue, 22 Feb 2011 19:41:12 +0100

On 22.02.2011 18:17, Julian Foad wrote:
>> Proposed Support Library
>> ========================
>>
>> Assumptions
>> -----------
>>
>> The main assumption is that we'll keep using APR for character set
> s/character set/character encoding/.
>
>> conversion, meaning that the recoding solution to choose would not
>> need to provide any other functionality than recoding.
> s/recoding/converting between NFD and NFC UTF8 encodings/.

Actually -- you have to go all the way and support complete
normalization, even if your normalization targets are only NFC and NFD.
That's because there isn't a sane way to detect whether a string is
normalized or not -- "sane" in the sense that it should take about as
long to discover that as to just normalize it.

>> Proposed Normal Form
>> ====================
>>
>> The proposed internal 'normal form' should be NFC, if only if
>> it were because it's the most compact form of the two [...]
>> would give the maximum performance from utf8proc [...]
> I'm not very familiar with all the issues here, but although choosing
> NFC may make individual conversions more efficient, I wonder if a
> solution that involves normalizing to NFD could have benefits that are
> more significant than this. (Reading through this doc sequentially, we
> get to this section on choosing NFC before we get to the list of
> possible solutions, and it looks like premature optimization.)

It's like this: Once we impose a normalization form for our internal
representation, we /always/ have to normalize, regardless of which
system we're on, because we can't (or rather, don't want to) trust the
host system to do it right.

For example, on Windows, file names are NFC/UTF-16; so if APR preserves
the normalization when converting to UTF-8, then our internal
normalization is essentially a no-op -- but we still have to do it, if
only to make sure that our internal representation is correct (see above
about detection not being significanly faster than normalization).

> For example, a solution that involves normalizing all input to NFD would
> have the advantages that on MacOSX it would need to do *no* conversions
> and would continue to work with old repositories in Mac-only workshops.

You'd make this configurable? But how? How do you prove that paths in
old repositories are normalized in a certain way? You can only assume
that for paths that you know were normalized before being written to the
repository. And even then, you can't assume too much -- an older tool,
without normalization, can still write denormalized strings to the
repositury vial file://. Unless you want to have an explicit flag for
every path to see if it's normalized or not -- which implies changing
the repository format -- then you can only really make assumptions about
normalization of paths in the repository post-2.0.

-- Brane
Received on 2011-02-22 19:41:50 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: Comments on 'notes/unicode-composition-for-filenames'"
Previous message: Stefan Sperling: "Why does import fail with autoprops ("inconsistent eol style")?"
In reply to: Julian Foad: "Comments on 'notes/unicode-composition-for-filenames'"
Next in thread: Stefan Sperling: "Re: Comments on 'notes/unicode-composition-for-filenames'"
Reply: Stefan Sperling: "Re: Comments on 'notes/unicode-composition-for-filenames'"
Reply: Daniel Shahaf: "Re: Comments on 'notes/unicode-composition-for-filenames'"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]