Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

From: Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
Date: Sun, 21 Feb 2016 00:16:29 +0300

Branko ÄŒibej <brane_at_apache.org> writes:

> Not really. For example, 'Ã¡' and 'A' are equivalent, but 'ÃŸ' and 'SS'
> are not â€” whereas the latter should be equivalent in German, but I doubt
> very much that utf8proc does that right. Case-insensitive comparison
> must *always* be done in the context of a well-defined locale. Anything
> that calls itself "locale-independent" is likely to be wrong in a really
> huge number of cases.

The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
on how case-insensitive matching should be done [1]:

    Default caseless matching is the process of comparing two strings for
    case-insensitive equality. The definitions of Unicode Default Caseless
    Matching build on the definitions of Unicode Default Case Folding.

Default Caseless Matching uses full case folding:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

toCasefold(X): Map each character C in X to Case_Folding(C).

Case_Folding(C) uses the mappings with the status field value â€œCâ€ or
â€œFâ€ in the data file CaseFolding.txt in the Unicode Character Database.

When comparing strings for case-insensitive equality, the strings should
also be normalized for most correct results.

The behavior we get with this patch is well-defined and follows the spec,
since we normalize and fold the case of the strings with utf8proc. (The
UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
omitting special case T.)

>> But I'm wondering why you added this feature to an existing function?
>>
>> I don't think it is recommended practice to perform the normalization this
>> way and adding a boolean to an existing function makes it easier to do
>> perform things in a not recommended way.
>
> Adding flags that drastically change the semantics of a function is just
> broken API design, period.

I don't think that we expose this functionality in a broken way. There aren't
that many options to choose from, since we need to perform the normalization
and the case folding in a single call to utf8proc, with appropriate flags set.
We could add an svn_utf__casefold() function that does both, but I'd rather
prefer what we have now.

After all, the maintainers of utf8proc expose its features in a quite similar
fashion [3] â€” with a normalize_string(..., casefold=true/false) function.

[1] http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf
[2] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
[3] https://julia.readthedocs.org/en/latest/stdlib/strings/#Base.normalize_string

Regards,
Evgeny Kotkov
Received on 2016-02-20 22:16:55 CET

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]