[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-05-23 18:28:55 CEST

Greg Hudson <ghudson@MIT.EDU> writes:

> By itself, no. But I think it's reasonable for application developers
> not to incur the hair of charset conversions when there is a superior
> approach available to the world.

To you, it may be superior. To us who _actually use_ non-ASCII
characters, the prospect of replacing all our tools does not have a
particularly "superior" ring. If anyone wants to take the step, the
support will be there. Those who don't, will be supported as well.

> > If you want to use UTF-8 under Unix, all you have to do is select an
> > UTF-8 locale and the conversions will be identity conversions.
>
> I note that wasn't the default you chose.

The default locale is chosen by the OS. You can change it by
modifying your login files, or by selecting a locale on the login
screen in systems like CDE. This is nothing Subversion-specific.

> The "advantage" I was talking about was not having to do character set
> conversion in applications, not any particular advantage to the user.
> (Although the user also benefits indirectly from having a single
> character set for all languages.)

The advantage of allowing the programmers to be lazy is rather
insignificant compared to the disadvantage of imposing undesried
charsets on the users.

> It's true, anything which shows up in XML and isn't encoded will be
> checked for UTF-8 validity by expat. As far as I know, the user-visible
> objects which meet this criterion are filenames and property names.
> Property values and file data only show up in XML after being
> base64-encoded, as far as I am aware.

That's good. But unless filenames and property names are known to be
UTF-8, they too would have to be base64-encoded.

> I'm guessing property names aren't a big issue. If the issue is mainly
> filenames, then it might be okay to handle conversion, if you do so by
> wrapping an svn_file_open() around apr_file_open(). Don't do it by
> adding a conversion step before every file open call.

Yes, adding more wrapper functions to apr calls would probably be a
good idea. There are already some in io.c.

> Does this apply to the libraries? A library function which prints to
> stdout/stderr is useless for GUI programs, so we generally try to avoid
> that. And I don't think we do much in the way of logging.

Well, you have svn_handle_error, for example. And svn_client_diff.
Functions that the client calls explicitly to have stuff printed on a
stream.

> Anyway, messages presumably need to be localized, not just
> charset-converted. (If the message contains a filename, the filename
> might need to be charset-converted.) It doesn't add much value to
> charset-convert them, other than perhaps filenames, without doing
> localization as well.

Localized error messages would be nice, but are not as important. The
main thing here is not having the error message say

  svn_wc_merge: `Ã¥iÃ¥aäeö' not under revision control

when the file is actually called `åiåaäeö'. Since the string has to
be converted anyway, it makes a cleaner design to keep it as UTF-8 in
the error struct, and convert it on output instead.

> > · Name service calls such as getpwnam need conversion
> > · Command line arguments passed to exec need conversion
>
> I don't see a reason why we should be mucking around in any way with
> usernames and command-line arguments provided by the operating system or
> user.

Not all arguments are provided by the OS or user. When running diff,
for example, the labels are provided by Subversion. The argments
which _are_ provided by the user are converted to UTF-8 and then back
again simply for symmetry reasons. It's _much_ easier to specify that
all strings entering the libsvn layer have to be UTF-8 encoded, than
trying to keep track of which ones are and which ones are not.

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu May 23 18:33:45 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.