21 apr 2013 kl. 20.07 skrev Branko ─îibej:
> Yes, the obvious ones are German (├č == SS) equivalence and turkic (i
> ─░) and (─▒ == I) equivalences (and that's aready three characters);
> then in French, lowercase accented letters are equivalent to uppercase
> unaccented letters, whereas for example in Spanish that's not the
> And that's just looking at European and West Asian Latin scripts.
> are at least 7 distinct Cyrillic scripts in roughly the same area that
> I'm aware of, and I certainly don't know the case-folding rules for
> of them.
Not only is the above true, one should also be careful to distinguish
case conversion from case-insensitive matching; these follow different
For instance, converting lower-case letters to upper case in French
will retain the accents (most of the time - this is locale-dependent),
but they are generally expected to be ignored when searching. By
contrast, it would be an error to match "a" with "├Ą" in Swedish when
searching, or to drop the dots in a case conversion.
Clearly a case- and accent-sensitive search is much easier to
implement, but would benefit from normalisation. Bytewise matching is
on the lowest rung.
Received on 2013-04-21 22:19:15 CEST