[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2002-05-30 17:30:44 CEST

Greg Stein <gstein@lyra.org> writes:
> > If a log message is in some unknown and unknowable charset, I can't
> > stick the text into a text widget and have any confidence that something
> > legible will be displayed.
> Yup.
> Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
> particular comment from svn_fs.h:
> /* Here are the rules for directory entry names, and directory paths:
> A directory entry name is a Unicode string encoded in UTF-8, and
> may not contain the null character (U+0000). The name should be in
> Unicode canonical decomposition and ordering. No directory entry
> ...

Hmm, but that's just talking about paths. No one disagrees that paths
should be enforced to one canonical format. The issue here is log
messages (the fact that log messages are stored as property values is
an implementation detail -- I don't think the ideal that property
values support binary data has any influence one way or the other on
whether binary log messages should be allowed).

> We've always considered all properties to be binary. Thus, ra_dav will need
> to encode it in some fashion to keep it safe within an XML body. While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8. And we informally decided (meaning: it isn't written down
> like what is in svn_fs.h) on using UTF-8 as our library's character set a
> long time ago also. Maybe I could find a reference, but I'm not going to
> bother. We *did* choose it, so people can attempt to prove otherwise or
> provide some technical reason why choosing one charset is Badness(tm).

I don't understand the connection here.

We didn't decide that all data coming into fs is UTF-8. We decided
that pathnames were UTF-8, and that file contents and property values
would be binary data (as far as the fs is concerned).

This doesn't mean we can't enforce some convention for log messages in
particular, but such a decision is certainly not *implied* by anything
in the design of the fs right now.

> While the
> log message *happens* to be a property, the interface calls it a char*,
> which means UTF-8.

The interface calls log messages `char *' as of one day ago :-), and
that's just fallout from 2024. There are comments in the code,
indicating that maybe it should go back to supporting binary data, as
it did up until 2024.


To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:35:51 2002

This is an archived mail posted to the Subversion Dev mailing list.