[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: log --search test failures on trunk and 1.8.x

From: Branko Čibej <brane_at_wandisco.com>
Date: Wed, 24 Apr 2013 07:34:51 +0200

On 23.04.2013 14:51, Stefan Sperling wrote:
> On Tue, Apr 23, 2013 at 02:27:08PM +0200, Branko Čibej wrote:
>> You're missing the point. tolower() works on individual characters, not
>> whole strings; so it in general /cannot/ do correct locale-specific
> Do you really mean characters, or bytes?
> It sounds like you mean bytes. tolower() works on individual bytes.

It *does not matter* whether it's bytes or characters, it still cannot
do correct local-specific lowercasing.


>> Trying to retrofit anything less
>> smart onto apr_fnmatch will not work correctly.
> That depends on whether an fnmatch implementation is willing to live
> with the limitations of the locale mechanism (one opaque charset
> supported, any charset not in the current locale can give errors).

For the case we're considering, we don't care about conversion from
UTF-8, since we require log messages to be in UTF-8 anyway.

> It seems that some people do think fnmatch() should do it this way:
> http://opensource.apple.com/source/Libc/Libc-583/gen/FreeBSD/fnmatch.c
> (Caution: This implementation has the out-of-bounds recursion bug
> which made Bill rewrite fnmatch for APR...)
> Subversion already assumes it can convert strings from UTF-8 to the
> locale's character set for output. We could also assume that we can
> convert log messages from UTF-8 to the current locale charset, and
> write something that performs case-insensitive matching with wchar_t.
> However, that's clearly out of scope for 1.8 as well :)

Let me say again: comparing single characters is not correct case
folding. German is a good example of why that doesn't work: it does not
just have the ß == SS equivalence; for case-insensitive search, I'd also
expect ö == OE/oe == Ö etc. to be equivalent.

If you consider all this, the easiest approach by far might be to simply
add a Lucene index of all log messages to the server, then you can and
any number of bells and whistles including language-specific stemming.
I'd consider that a better solution then any homegrown full-text search
facility; these are never easy.

-- Brane

Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com
Received on 2013-04-24 07:35:30 CEST

This is an archived mail posted to the Subversion Dev mailing list.