On Mon, Apr 22, 2013 at 01:13:43PM +0200, Branko Čibej wrote:
> On 22.04.2013 12:59, Bert Huijben wrote:
> > The assertion shows a design problem which we should handle for future compatibility and you suggest just adding some bandages to patch/hide the test failure?
> >
> > The current code is broken and the suggestion you do is like the solution mostly vetoed by most of the responders in this thread: assuming there is only us-english, by using a function that has platform specific behavior.
> >
> > (tolower() is locale and platform character encoding dependent. You should never just pass individual UTF-8 bytes to it)
OK Bert, I can see how, for example, a tolower() implementation which runs
in a latin1 locale could convert parts of a UTF-8 string which contains
bytes that are part of a multibyte character, if such bytes happen to have
the same value as some upper case letter from the latin1 symbol range [128-255].
Users could be running --search for one or more such upper-case characters
and get results that match at the byte-level (after tolower() conversion)
but don't make any sense because the characters matched are entirely different.
E.g. they get could get some log message which contains only Chinese
characters when they run 'svn log --search Ö', in case the byte used
for Ö matches part of the UTF-8 representation of a Chinese character.
I can see that this kind of behaviour could be confusing, and that it
is "undefined" in the sense that it is locale-dependent. But I find
your notion that it is entirely buggy somewhat exaggerating, given that
apr_fnmatch() makes no claims to support multibyte characters in the
first place. So please provide an example for the kinds of bugs you
are thinking of -- are they any worse than my example?
In case you're making an argument based on the assumption that tolower()
could try to do something intelligent with UTF-8 characters, let's agree
that tolower() cannot use multibyte characters. It only supports single byte
character sets where symbols are taken from the range [0-255], and it also
supports the special symbol EOF (represented by a -1 in some implementations)
which is why it accepts an 'int'. There is a towlower() function which accepts
'wchar_t' instead of 'int' and could thus be used to support multibyte
characters. But apr_fnmatch() doesn't use towlower().
Perhaps fnmatch() simply isn't a good interface to use for the purposes
of log --search. But it is all that APR provides for us.
> > But then please just hardcode converting 'a-z' to 'A-Z' for the log messages instead of pretending tolower() handles everything else for you. This has strictly defined behavior and shows that we need a better implementation later. While using tolower() with a hack around it just ships undefined behavior.
>
Hard-coding a conversion from a-z to A-Z before passing strings
to apr_fnmatch() would indeed provide all functionality we can expect
from apr_fnmatch() with the CASE_BLIND flag set.
But this would really be a workaround to avoid the CASE_BLIND flag.
If we're going to say that CASE_BLIND should not be used, I'd like
to point out that the autoprops code is also using it and could thus
suffer similar problems (e.g. wrongly apply properties to files with
certain multibyte-character names in certain locales).
Do we need to do anything there?
> I tend to agree. As it is, it would be better to make --search
> case-sensitive, or remove it from 1.8, than to tell people it's
> case-insensitive but just happens not to work as advertised.
Branko,
we haven't documented how 'svn log --search' behaves in this regard.
As far as I'm concerned its behaviour can change between releases.
It's a feature of the command line client, not the API, and it's
meant to be a convenient function for humans, not scripts.
If 1.8 matches English characters in a case-insensitive way, and characters
from other languages in a case-sensitive way, and we fix that in 1.9 or
later e.g. by switching away from fnmatch() to something better, then
I don't see any problem.
If this were implemented as an API, I'd have more concerns, of course.
Received on 2013-04-23 08:54:14 CEST