[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC/PATCH] commit messages not 8-bit compatible

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-05-30 15:15:35 CEST

Greg Hudson <ghudson@MIT.EDU> writes:

> Every time we open a file, we have to convert the pathname from UTF-8 to
> the local character encoding. In an ideal world APR might take care of
> this for us, but it doesn't. (Fortunately, we can just wrap
> apr_file_open() with our own function.)

Yeah, almost all of the conversions inside the libraries are shuffled
into io.c, which is bascially an extension to APR anyway. In fact,
changing the code to use the new wrappers instead of APR directory
actually makes it neater in many cases, since it doesn't have to
fiddle around with converting apr_status_t to svn_error_t* anymore.

> Every time we display a message, we have to convert it. Again, APR
> might conceivably take care of this for us, but it doesn't.

I made a wrapper for apr_file_printf (it doesn't actually call
apr_file_printf, but it calls apr_pvsprintf just like apr_file_printf
would have) that handles this, so again, this is fairly transparent.
There is still some fudge in svn_handle_{error,warning} though.

> When we prompt the user for a log message via $EDITOR, what we get back
> is in the local character encoding. Hard to imagine APR taking care of
> this.

This is done by the client, not by the libraries.

> There are more interactions as well. The libraries interact not just
> with the client, but with the operating system.

And with BerkeleyDB. Apart from the file/dir access stuff, the impact
is pretty limited though. The situation could have been much worse if
an attempt had been made to have some strings in UTF-8 and some not.

> There are certainly advantages to the UTF-8 approach, but "avoiding
> character set nonsense" in the libraries is not one of them.

To some extent. If we agree that things like path names and log
messages are sequences of characters rather than sequences of bits
(done any commits on the file 0110110001101110011001110010111001100011
lately? :) to the user, and that Subversion should support this view,
then it does avoid a lot of nonsense to use a uniform character
encoding internally. The alternative would be to keep track of the
actual encoding of strings either by affixing some meta information to
the string itself, or by policy (_this_ particular string will always
be US-ASCII, and _that_ string will be ISO-2022, kind of thing).
Otherwise the interpretation is lost.

(And that's _if_ we agree of course. We could just say "screw the
 user" in this regard like CVS does (no sarcasm intended), and there
 will be even less nonsense in the libs (the nonsense will instead
 have to be dealt with by the user). But this is an important
 opportunity to do better than CVS, and that's what Subversion is all
 about, isn't it?)

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:36:11 2002

This is an archived mail posted to the Subversion Dev mailing list.