On 21.04.2013 17:11, Ivan Zhakov wrote:
> On Sun, Apr 21, 2013 at 4:48 PM, Branko ÄŒibej <brane_at_wandisco.com> wrote:
>> On 21.04.2013 14:05, Stefan Sperling wrote:
>>> On Sun, Apr 21, 2013 at 01:53:43PM +0200, Bert Huijben wrote:
>>>> I'd rather pull the case insensitive search part of this new in 1.8 search feature and do it right in 1.9.
>>> What's the issue with the current implementation apart from the
>>> test failures on Windows?
>>>
>>> The behaviour of 'svn log --search' regarding case-sensitivity
>>> isn't even documented, so we're not really prosmising anything.
>>>
>>> It is possible that some users who are using languages other than
>>> English will complain, since ASCII is being matched case-insensitively,
>>> and all other characters are being matched case-sensitively.
>>> But this is due to a missing feature in APR's implemention of fnmatch().
>>>
>>> Provided we can fix the 1.8.x tests on Windows I see no reason to
>>> change our implementation of log --search. We can simply wait for
>>> APR to grow the necessary support for multibyte strings.
>> The wc-collate-path branch has an svn_utf__glob function that's mainly
>> intended for use by SQLite, however, it can be a replacement for
>> apr_fnmatch. It uses apr_fnmatch internally, but decomposes the inputs
>> to Unicode normalization form D, which keeps diacriticals separate from
>> the base letters. In other words, we could easily extend that to do
>> completely diacritical-agnostic case-folding matching for Latin
>> alphabets (and probably also for Cyrillic scripts).
>>
>> The idea to manually hack things to work with western Latin alphabets
>> seems completely wrong-headed to me.
>>
>> But yes; in general, case folding is locale-specific. If we wanted to
>> support that, we'd need ICU instead of utf8proc. I can imagine that
>> eventually being an option, but not a mandatory dependency.
>>
> According to Unicode case folding data [1] the only two characters
> uses locale specific case-folding.
How on earth did you come to that conclusion?
Yes, the obvious ones are German (ß == SS) equivalence and turkic (i ==
İ) and (ı == I) equivalences (and that's aready three characters); but
then in French, lowercase accented letters are equivalent to uppercase
unaccented letters, whereas for example in Spanish that's not the case.
And that's just looking at European and West Asian Latin scripts. There
are at least 7 distinct Cyrillic scripts in roughly the same area that
I'm aware of, and I certainly don't know the case-folding rules for all
of them.
-- Brane
--
Branko ÄŒibej
Director of Subversion | WANdisco | www.wandisco.com
Received on 2013-04-21 20:08:19 CEST