On 23.04.2013 14:51, Stefan Sperling wrote:
> On Tue, Apr 23, 2013 at 02:27:08PM +0200, Branko Čibej wrote:
>> You're missing the point. tolower() works on individual characters, not
>> whole strings; so it in general /cannot/ do correct locale-specific
> Do you really mean characters, or bytes?
> It sounds like you mean bytes. tolower() works on individual bytes.
It *does not matter* whether it's bytes or characters, it still cannot
do correct local-specific lowercasing.
>> Trying to retrofit anything less
>> smart onto apr_fnmatch will not work correctly.
> That depends on whether an fnmatch implementation is willing to live
> with the limitations of the locale mechanism (one opaque charset
> supported, any charset not in the current locale can give errors).
For the case we're considering, we don't care about conversion from
UTF-8, since we require log messages to be in UTF-8 anyway.
> It seems that some people do think fnmatch() should do it this way:
> (Caution: This implementation has the out-of-bounds recursion bug
> which made Bill rewrite fnmatch for APR...)
> Subversion already assumes it can convert strings from UTF-8 to the
> locale's character set for output. We could also assume that we can
> convert log messages from UTF-8 to the current locale charset, and
> write something that performs case-insensitive matching with wchar_t.
> However, that's clearly out of scope for 1.8 as well :)
Let me say again: comparing single characters is not correct case
folding. German is a good example of why that doesn't work: it does not
just have the ß == SS equivalence; for case-insensitive search, I'd also
expect ö == OE/oe == Ö etc. to be equivalent.
If you consider all this, the easiest approach by far might be to simply
add a Lucene index of all log messages to the server, then you can and
any number of bells and whistles including language-specific stemming.
I'd consider that a better solution then any homegrown full-text search
facility; these are never easy.
Director of Subversion | WANdisco | www.wandisco.com
Received on 2013-04-24 07:35:30 CEST