[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

From: Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
Date: Wed, 24 Feb 2016 16:30:25 +0300

Branko Čibej <brane_at_apache.org> writes:

> Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
> normalize plus casefold) as a separate API. Internally, it can be
> implemented with that extra flag, but even for a private API, I think
> it's better to make each function do one thing.

After giving it more thought, I agree that a separate API is a better choice
here. For now, I added svn_utf__casefold() in r1732152.

> Instead of relying on the Unicode spec, I propose a different approach:
> to treat accented letters as if they don't have diacriticals at all.
> This should be fairly easy to do with utf8proc: in the intermediate,
> 32-bit NFD string, remove any character that's in the
> combining-diacritical group, and then convert the result to NFC UTF-8.
> I've done this before with fairly good results; it's also much easier to
> explain this behaviour to users than to tell them, "read the Unicode spec".

I see that utf8proc has UTF8PROC_STRIPMARK flag that does something
similar to what you describe. The difference is that this option strips the
codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or
Me (Enclosing_Mark) categories [1].

Although that's more than just removing the characters that are marked as
Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use
this flag. How does this cope with what you propose?

Another question is about exposing this ability in the API. I'd say that we
could do something like this:

  svn_utf__transform(svn_boolean_t normalize,
                     svn_boolean_t casefold,
                     svn_boolean_t remove_diacritics)

  (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?)

Do you have an opinion or suggestions about that?

[1] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
[2] http://www.unicode.org/charts/PDF/U0300.pdf
[3] http://www.unicode.org/charts/PDF/U1AB0.pdf
[4] http://www.unicode.org/charts/PDF/U1DC0.pdf
[5] http://www.unicode.org/charts/PDF/U20D0.pdf

Evgeny Kotkov
Received on 2016-02-24 14:30:50 CET

This is an archived mail posted to the Subversion Dev mailing list.