On 24.02.2016 14:30, Evgeny Kotkov wrote:
> Branko Čibej <brane_at_apache.org> writes:
>> Instead of relying on the Unicode spec, I propose a different approach:
>> to treat accented letters as if they don't have diacriticals at all.
>> This should be fairly easy to do with utf8proc: in the intermediate,
>> 32-bit NFD string, remove any character that's in the
>> combining-diacritical group, and then convert the result to NFC UTF-8.
>> I've done this before with fairly good results; it's also much easier to
>> explain this behaviour to users than to tell them, "read the Unicode spec".
> I see that utf8proc has UTF8PROC_STRIPMARK flag that does something
> similar to what you describe. The difference is that this option strips the
> codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or
> Me (Enclosing_Mark) categories [1].
>
> Although that's more than just removing the characters that are marked as
> Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use
> this flag. How does this cope with what you propose?
This is probably even better than just removing combining diacriticals,
because it should work well with non-latin/cyrillic scripts, too.
> Another question is about exposing this ability in the API. I'd say that we
> could do something like this:
>
> svn_utf__transform(svn_boolean_t normalize,
> svn_boolean_t casefold,
> svn_boolean_t remove_diacritics)
>
> (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?)
>
> Do you have an opinion or suggestions about that?
The big question here is what we'll use the API for. Currently we have a
'normalize' function that's used by svn_fs_verify (IIRC). Since we're
talking about a funciton that transforms a UTF-8 string to a shape
suitable for stuff-insensitive comparison, we could follow the example
of the standard strxfrm() -> svn_utf__xfrm(); but if that's too ugly, my
preference is for svn_utf__fold().
However, I'd not add arguments for normalization/case folding/etc; I'd
just make this function DTRT without any additional flags, because
otherwise we'll always be second-guessing the correct invocation.
If there's a use case for case-folding vs. non-case folding, then make
two functions: svn_utf__xfrm and svn_utf__xfrm_casefold.
(Again, obviously, all of these -- including svn_utf__normalize -- need
only one private impltmentation in the source.)
-- Brane
Received on 2016-02-24 15:07:12 CET