Re: log --search test failures on trunk and 1.8.x

From: Bert Huijben <bert_at_qqmail.nl>
Date: Sun, 21 Apr 2013 18:50:16 +0000

Note that Unicode 3.2 (where these links point to) is over 10 years old. The current version is 6.2... and things haven’t got easier with newer versions.

A different route would be to convert the characters from our utf-8 to the native platform encoding (using our existing apis for that) and then make the platform do the case folding for us before apr does the comparison/search.

Bert

Sent from Windows Mail

From: Ivan Zhakov
Sent: ‎Sunday‎, ‎April‎ ‎21‎, ‎2013 ‎8‎:‎44‎ ‎PM
To: Branko Čibej
Cc: dev_at_subversion.apache.org

On Sun, Apr 21, 2013 at 10:07 PM, Branko Čibej <brane_at_wandisco.com> wrote:
> On 21.04.2013 17:11, Ivan Zhakov wrote:
>> On Sun, Apr 21, 2013 at 4:48 PM, Branko Čibej <brane_at_wandisco.com> wrote:
>>> On 21.04.2013 14:05, Stefan Sperling wrote:
>>>> On Sun, Apr 21, 2013 at 01:53:43PM +0200, Bert Huijben wrote:
>>>>> I'd rather pull the case insensitive search part of this new in 1.8 search feature and do it right in 1.9.
>>>> What's the issue with the current implementation apart from the
>>>> test failures on Windows?
>>>>
>>>> The behaviour of 'svn log --search' regarding case-sensitivity
>>>> isn't even documented, so we're not really prosmising anything.
>>>>
>>>> It is possible that some users who are using languages other than
>>>> English will complain, since ASCII is being matched case-insensitively,
>>>> and all other characters are being matched case-sensitively.
>>>> But this is due to a missing feature in APR's implemention of fnmatch().
>>>>
>>>> Provided we can fix the 1.8.x tests on Windows I see no reason to
>>>> change our implementation of log --search. We can simply wait for
>>>> APR to grow the necessary support for multibyte strings.
>>> The wc-collate-path branch has an svn_utf__glob function that's mainly
>>> intended for use by SQLite, however, it can be a replacement for
>>> apr_fnmatch. It uses apr_fnmatch internally, but decomposes the inputs
>>> to Unicode normalization form D, which keeps diacriticals separate from
>>> the base letters. In other words, we could easily extend that to do
>>> completely diacritical-agnostic case-folding matching for Latin
>>> alphabets (and probably also for Cyrillic scripts).
>>>
>>> The idea to manually hack things to work with western Latin alphabets
>>> seems completely wrong-headed to me.
>>>
>>> But yes; in general, case folding is locale-specific. If we wanted to
>>> support that, we'd need ICU instead of utf8proc. I can imagine that
>>> eventually being an option, but not a mandatory dependency.
>>>
>> According to Unicode case folding data [1] the only two characters
>> uses locale specific case-folding.
>
> How on earth did you come to that conclusion?
>
> Yes, the obvious ones are German (ß == SS) equivalence and turkic (i ==
> İ) and (ı == I) equivalences (and that's aready three characters); but
> then in French, lowercase accented letters are equivalent to uppercase
> unaccented letters, whereas for example in Spanish that's not the case.
> And that's just looking at European and West Asian Latin scripts. There
> are at least 7 distinct Cyrillic scripts in roughly the same area that
> I'm aware of, and I certainly don't know the case-folding rules for all
> of them.
>
I've just read Unicode specs, but I didn't read all of them :)

According to the link I provided [1] there are 4 types of characters
in terms of case folding:
[[[
# The status field is:
# C: common case folding, common mappings shared by both simple and
full mappings.
# F: full case folding, mappings that cause strings to grow in length.
Multiple characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.
# T: special case for uppercase I and dotted uppercase I
# - For non-Turkic languages, this mapping is normally not used.
# - For Turkic languages (tr, az), this mapping can be used instead
of the normal mapping for these characters.
]]]
In CaseFolding-3.2.0.txt the only 'T' chars needs locale depended handling.

BUT there is another document that describes special case-folding
rules [2] which list cases like ß == SS. I missed it. That's my fault.

[1] http://www.unicode.org/Public/3.2-Update/CaseFolding-3.2.0.txt
[2] http://www.unicode.org/Public/3.2-Update/SpecialCasing-3.2.0.txt

--
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com
Received on 2013-04-21 20:58:17 CEST

This message: [ Message body ]
Next message: Ivan Zhakov: "Re: log --search test failures on trunk and 1.8.x"
Previous message: Ivan Zhakov: "Re: log --search test failures on trunk and 1.8.x"
In reply to: Ivan Zhakov: "Re: log --search test failures on trunk and 1.8.x"
Next in thread: Ivan Zhakov: "Re: log --search test failures on trunk and 1.8.x"
Reply: Ivan Zhakov: "Re: log --search test failures on trunk and 1.8.x"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]