[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-05-30 21:49:39 CEST

On Thu, May 30, 2002 at 10:30:44AM -0500, Karl Fogel wrote:
> Greg Stein <gstein@lyra.org> writes:
> > > If a log message is in some unknown and unknowable charset, I can't
> > > stick the text into a text widget and have any confidence that something
> > > legible will be displayed.
> >
> > Yup.
> >
> > Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> > particular comment from svn_fs.h:
>...
> Hmm, but that's just talking about paths.

Of course. I was showing one data, and my email was moving on to the rest.

>...
> The issue here is log
> messages (the fact that log messages are stored as property values is
> an implementation detail -- I don't think the ideal that property
> values support binary data has any influence one way or the other on
> whether binary log messages should be allowed).

Yes.

> > We've always considered all properties to be binary. Thus, ra_dav will need
> > to encode it in some fashion to keep it safe within an XML body. While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8. And we informally decided (meaning: it isn't written down
> > like what is in svn_fs.h) on using UTF-8 as our library's character set a
> > long time ago also. Maybe I could find a reference, but I'm not going to
> > bother. We *did* choose it, so people can attempt to prove otherwise or
> > provide some technical reason why choosing one charset is Badness(tm).
>
> I don't understand the connection here.
>
> We didn't decide that all data coming into fs is UTF-8. We decided

I was talking about interfaces -- parameters. Not file contents.

> that pathnames were UTF-8, and that file contents and property values
> would be binary data (as far as the fs is concerned).

Of course.

>...
> > While the
> > log message *happens* to be a property, the interface calls it a char*,
> > which means UTF-8.
>
> The interface calls log messages `char *' as of one day ago :-), and

And if this conversation was two days ago, I would have said stringbuf.

The point is: where we have char* in our interfaces, they are almost always
representing some characters. I'm saying that we decided on saying they were
UTF-8 and avoiding carrying around charset metadata with those.

To be concrete: either those char* params are UTF-8, or you add a second
parameter to state their charset. (or you just go charset neutral which
isn't really a good option)

Think back. Like two years ago. We said UTF-8 was the SVN charset. Not just
paths. But all the content [outside of file content and prop values].

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:16:59 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.