[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

use of UTF-8 (was: [RFC/PATCH] commit messages not 8-bit compatible)

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-05-30 01:00:12 CEST

On Wed, May 29, 2002 at 03:30:35PM -0500, Jon Trowbridge wrote:
> On Wed, 2002-05-29 at 14:30, cmpilato@collab.net wrote:
> > Marcus Comstedt <marcus@mc.pp.se> writes:
> > > Since there are no properties on log messages, how do you propose that
> > > the actual character encoding for a log message be recorded?
> >
> > That information, as you may have inferred from my previous paragraph,
> > is stored "out of band", in a HACKING file or something, and is
> > regulated by the repos admins.

Untenable.

> ...but if you do this, anyone who wants to write a GUI client that
> allows for log message browsing is out of luck.

Exactly.

> If a log message is in some unknown and unknowable charset, I can't
> stick the text into a text widget and have any confidence that something
> legible will be displayed.

Yup.

Our decision to use UTF-8 for stuff was made a *long* time ago. Here is a
particular comment from svn_fs.h:

/* Here are the rules for directory entry names, and directory paths:

   A directory entry name is a Unicode string encoded in UTF-8, and
   may not contain the null character (U+0000). The name should be in
   Unicode canonical decomposition and ordering. No directory entry
...

We've always considered all properties to be binary. Thus, ra_dav will need
to encode it in some fashion to keep it safe within an XML body. While the
log message *happens* to be a property, the interface calls it a char*,
which means UTF-8. And we informally decided (meaning: it isn't written down
like what is in svn_fs.h) on using UTF-8 as our library's character set a
long time ago also. Maybe I could find a reference, but I'm not going to
bother. We *did* choose it, so people can attempt to prove otherwise or
provide some technical reason why choosing one charset is Badness(tm).

> Requiring utf-8 here might seem onerous, but it is pretty much the only
> way to avoid a whole class of annoying charset problems down the road.

Right. If the API has a text string, then SVN says that text string is in
UTF-8. If we have standard properties that are to be interpreted as text,
then those will be stored as UTF-8 strings (within the binary property).

While APR doesn't talk about character sets for its API (wrongly, so, IMO),
the Subversion libraries *do*. Anything that is text will be UTF-8. Since
paths and URLs hold "characters" (but are hard to call "text"), they also
use UTF-8 for their character set.

[ and extend as applicable to other concepts in the API... ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
the following algorithm will get us a
successor for NR, which is likely (though not guaranteed) to be
relatively "near" NR.
   Setup a cursor on NR's node revision id in the `nodes' table;
   Advance cursor to next row;
   THIS_NR = Current cursor location;
   If THIS_NR.NodeId != NR.NodeId:
      /* unrelated node, no more node revisions of NR */
      return FAILURE;
   If THIS_NR.CopyId == NR.CopyId:
      && THIS_NR.TxnId is not a pending transaction:
      /* same node_id, same copy_id, must be different (older!) txn_id */
      return SUCCESS, THIS_NR;
   ELSE:
      DO:
         IF THIS_NR.TxnId > NR.TxnId:
         && THIS_NR.TxnId is not a pending transaction:
            /* same node_id, older copy_id, older txn_id */
            return SUCCESS, THIS_NR;
         Advance cursor to next row;
         THIS_NR = Current cursor location;
      WHILE (THIS_NR.NodeId == NR.NodeId)
   return FAILURE;
However, I realize that adding ordering to those IDs is probably not a
popular thought.  :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:22:50 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.