On 20.02.2016 22:16, Evgeny Kotkov wrote:
> Branko Čibej <brane_at_apache.org> writes:
>
>> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
>> are not — whereas the latter should be equivalent in German, but I doubt
>> very much that utf8proc does that right. Case-insensitive comparison
>> must *always* be done in the context of a well-defined locale. Anything
>> that calls itself "locale-independent" is likely to be wrong in a really
>> huge number of cases.
> The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
> on how case-insensitive matching should be done [1]:
>
> Default caseless matching is the process of comparing two strings for
> case-insensitive equality. The definitions of Unicode Default Caseless
> Matching build on the definitions of Unicode Default Case Folding.
>
> Default Caseless Matching uses full case folding:
>
> A string X is a caseless match for a string Y if and only if:
> toCasefold(X) = toCasefold(Y)
>
> toCasefold(X): Map each character C in X to Case_Folding(C).
>
> Case_Folding(C) uses the mappings with the status field value “C” or
> “F” in the data file CaseFolding.txt in the Unicode Character Database.
>
> When comparing strings for case-insensitive equality, the strings should
> also be normalized for most correct results.
>
> The behavior we get with this patch is well-defined and follows the spec,
> since we normalize and fold the case of the strings with utf8proc. (The
> UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
> omitting special case T.)
It may be well-defined and follow the spec, but the important question
IMO is whether the behaviour makes sense for our users. Bert mentioned
Turkish (i ∼ İ and I ∼ ı), I mentioned German, where ß ∼ SS, and French,
where ò ∼ O but that doesn't hold in Italian, where ò ∼ Ò ... these are
just off the top of my head, there must be tens or even hundreds of
examples where locale-independent case folding can give unexpected results.
Instead of relying on the Unicode spec, I propose a different approach:
to treat accented letters as if they don't have diacriticals at all.
This should be fairly easy to do with utf8proc: in the intermediate,
32-bit NFD string, remove any character that's in the
combining-diacritical group, and then convert the result to NFC UTF-8.
I've done this before with fairly good results; it's also much easier to
explain this behaviour to users than to tell them, "read the Unicode spec".
>>> But I'm wondering why you added this feature to an existing function?
>>>
>>> I don't think it is recommended practice to perform the normalization this
>>> way and adding a boolean to an existing function makes it easier to do
>>> perform things in a not recommended way.
>> Adding flags that drastically change the semantics of a function is just
>> broken API design, period.
> I don't think that we expose this functionality in a broken way. There aren't
> that many options to choose from, since we need to perform the normalization
> and the case folding in a single call to utf8proc, with appropriate flags set.
> We could add an svn_utf__casefold() function that does both, but I'd rather
> prefer what we have now.
>
> After all, the maintainers of utf8proc expose its features in a quite similar
> fashion [3] — with a normalize_string(..., casefold=true/false) function.
Heh, quoting bad examples doesn't make the API change any better. :)
Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
normalize plus casefold) as a separate API. Internally, it can be
implemented with that extra flag, but even for a private API, I think
it's better to make each function do one thing.
-- Brane
Received on 2016-02-22 09:09:28 CET