Note that Unicode 3.2 (where these links point to) is over 10 years old. The current version is 6.2... and things haven’t got easier with newer versions.
A different route would be to convert the characters from our utf-8 to the native platform encoding (using our existing apis for that) and then make the platform do the case folding for us before apr does the comparison/search.
Sent from Windows Mail
From: Ivan Zhakov
Sent: Sunday, April 21, 2013 8:44 PM
To: Branko Čibej
On Sun, Apr 21, 2013 at 10:07 PM, Branko Čibej <brane_at_wandisco.com> wrote:
> On 21.04.2013 17:11, Ivan Zhakov wrote:
>> On Sun, Apr 21, 2013 at 4:48 PM, Branko Čibej <brane_at_wandisco.com> wrote:
>>> On 21.04.2013 14:05, Stefan Sperling wrote:
>>>> On Sun, Apr 21, 2013 at 01:53:43PM +0200, Bert Huijben wrote:
>>>>> I'd rather pull the case insensitive search part of this new in 1.8 search feature and do it right in 1.9.
>>>> What's the issue with the current implementation apart from the
>>>> test failures on Windows?
>>>> The behaviour of 'svn log --search' regarding case-sensitivity
>>>> isn't even documented, so we're not really prosmising anything.
>>>> It is possible that some users who are using languages other than
>>>> English will complain, since ASCII is being matched case-insensitively,
>>>> and all other characters are being matched case-sensitively.
>>>> But this is due to a missing feature in APR's implemention of fnmatch().
>>>> Provided we can fix the 1.8.x tests on Windows I see no reason to
>>>> change our implementation of log --search. We can simply wait for
>>>> APR to grow the necessary support for multibyte strings.
>>> The wc-collate-path branch has an svn_utf__glob function that's mainly
>>> intended for use by SQLite, however, it can be a replacement for
>>> apr_fnmatch. It uses apr_fnmatch internally, but decomposes the inputs
>>> to Unicode normalization form D, which keeps diacriticals separate from
>>> the base letters. In other words, we could easily extend that to do
>>> completely diacritical-agnostic case-folding matching for Latin
>>> alphabets (and probably also for Cyrillic scripts).
>>> The idea to manually hack things to work with western Latin alphabets
>>> seems completely wrong-headed to me.
>>> But yes; in general, case folding is locale-specific. If we wanted to
>>> support that, we'd need ICU instead of utf8proc. I can imagine that
>>> eventually being an option, but not a mandatory dependency.
>> According to Unicode case folding data  the only two characters
>> uses locale specific case-folding.
> How on earth did you come to that conclusion?
> Yes, the obvious ones are German (ß == SS) equivalence and turkic (i ==
> İ) and (ı == I) equivalences (and that's aready three characters); but
> then in French, lowercase accented letters are equivalent to uppercase
> unaccented letters, whereas for example in Spanish that's not the case.
> And that's just looking at European and West Asian Latin scripts. There
> are at least 7 distinct Cyrillic scripts in roughly the same area that
> I'm aware of, and I certainly don't know the case-folding rules for all
> of them.
I've just read Unicode specs, but I didn't read all of them :)
According to the link I provided  there are 4 types of characters
in terms of case folding:
# The status field is:
# C: common case folding, common mappings shared by both simple and
# F: full case folding, mappings that cause strings to grow in length.
Multiple characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.
# T: special case for uppercase I and dotted uppercase I
# - For non-Turkic languages, this mapping is normally not used.
# - For Turkic languages (tr, az), this mapping can be used instead
of the normal mapping for these characters.
In CaseFolding-3.2.0.txt the only 'T' chars needs locale depended handling.
BUT there is another document that describes special case-folding
rules  which list cases like ß == SS. I missed it. That's my fault.
CTO | VisualSVN | http://www.visualsvn.com
Received on 2013-04-21 20:58:17 CEST