Re: UTF-8

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-05-23 16:03:41 CEST

Greg Hudson <ghudson@MIT.EDU> writes:

> (Major irritation: I can't get at Marcus's patch via the list archives
> to review it again. When I click on the attachment, I get a server
> error. This is in addition to having to frob the URL to get to the
> messages at the end of a month to find it in the first place.)

Hm, does this have anything to do with the text/x-patch MIME type
perhaps?

> Or they have more incentive to want tools which use UTF-8. We lose most
> of the advantage of Unicode if we have to put in the conversion hooks
> everywhere anyway to support legacy character sets.

Incentive is one thing. But Subversion is not in a position to
_demand_ such a thing. It is a support tool that has to fit in with
the tools and operating systems already in use. If you want to use
UTF-8 under Unix, all you have to do is select an UTF-8 locale and the
conversions will be identity conversions. Us who want to use other
locales can do so, and have the strings converted accordingly. To me
it's much more important to have characters like хфі show up properly
in other tools like the shell, than being able to use characters like
"happy little snowman with combining broom above". It's should be up
to each individual user to choose what kind of "advantage of Unicode"
he would prefer.

> Also, even if we don't do conversion, I don't think we really hurt sites
> which want to use some alternate character set. We can version
> arbitrary binary data, not just UTF-8 data. We don't enforce UTF-8
> validity of strings anywhere in the code. So as long as everyone is
> using the same character set, they'll be okay even if it isn't UTF-8.

That's the approach taken by CVS, and it works fairly well in
practice. I can't say why the UTF-8 approach was chosen for
Subversion, since I was not part of taking that decision. (Although
it's not 100% accurate to say thay UTF-8 validity of strings need not
be enforced, since strings are being put in XML files without charset
declarations, and such XML files must conform to UTF-8 validity
rules. I think the actual validity test has been removed, but that
doesn't make it technically correct to dump binary strings straight
into XML data. Things like base64 could be used to fix this though.)

For _file contents_, binary properties should be assumed of course.
No conversions are done for those. (Although it might be neat to be
able to manually enable conversions for file contents through
properties. That's probebly post 1.0 though.)

> "All system calls" is kind of vague, but the implication is that the
> Subversion libraries will write out and read in all data in the native
> character set--to files, and over the wire on the network.

I was too vague, yes. It does _not_ apply to data written to sockets
and files that are internal to Subversion. For example, the contents
of the property files are still UTF-8. Neither does it apply to
external files/streams which are _supposed_ to be in UTF-8, such as
those specified with --xml-file.

> and that a
> working directory couldn't be moved between machines which used
> different character sets.

A working directory containing _filenames_ which are non-ASCII can
only be moved if the tool used to do the move takes care of
translating the filenames. But I think that is acceptable, such is
the normal situation when moving stuff between systems.

> If you're taking a less expansive approach than that, please describe
> what the libraries do which needs to do character set conversion, and
> what doesn't.

Conversion needed:

З Messages printed to stdout/stderr or non-XML logfiles need
conversion
З Use of pathnames to local files and directories need conversion
З Name service calls such as getpwnam need conversion
З Command line arguments passed to exec need conversion

Conversion not needed:

З Data sent between client and server do not need conversion
З Data stored in internal files do not need conversion
З Data written to XML files/streams do not need conversion
З Contents of files under version control do not need conversion (but
may not be assumed to be UTF-8 by the code; something which the
keyword expansion code has to be aware of.)

I hope this will make it a little more clear. Apologies for the
previous vagueness.

// Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu May 23 16:08:40 2002

This message: [ Message body ]
Next message: Daniel Stenberg: "Re: UTF-8"
Previous message: Greg Hudson: "Re: Collections in Subversion. Part 2 of 2: Request for Comment"
In reply to: Greg Hudson: "Re: UTF-8"
Next in thread: Greg Hudson: "Re: UTF-8"
Reply: Greg Hudson: "Re: UTF-8"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]