Where to convert?

From: Ulrich Drepper <drepper_at_redhat.com>
Date: 2002-07-22 08:41:08 CEST

I talked to somebody (don't know who anymore) about this some time ago.
Sander Striker said I should bring it up on the mailing list again so
here we go. Sander hasn't said that this was already widely discussed
so please excuse if it was. I would have raised the issue anyway.

What I'm concerned about is the place where all the codeset conversions
happen and how they happen.

My current assumption is that all the conversions happen in the client
and that transmitted text is unconditionally transferred and stored in
UTF-8. That's a nice and easy model and it distributes the load of the
whole process more evenly.

But I think this must at least not be the only mode of operation and not
even the default. It has some severe problems:

A very often the developers of a project consistently use one codeset
making all the back and forth conversions unnecessary overhead

B by relying on different converters on all the different clients the
process opens itself up to bugs and/or inconsistencies in the
converters

C not all codeset can be converted lossless to Unicode.

So we have a performance, correctness, and existential problem. Each of
which I'd consider reason enough to rethink the process.

Performance is still and will always be an issues. If all text much be
converted lots of copy operations have to be performed which has
measurable impact on all systems. And bugs in converters do exist.
It's not a nice thing to not be able to edit a file because the author
used a system where the conversion is handled differently. And
declaring codesets which are not encoded in Unicode as not worthy to be
supported is just ignorant.

Note that the last point is much more important for svn than for other
projects using exclusively Unicode internally (such as some GUI
toolkits). The latter normally don't have to deal with arbitrary texts,
just texts which appear in menus and dialogs. There might be an editor
widget but it's not the only play in town. Svn on the other hand is the
one and only tool to use. Imagine somebody at a research library who's
translating documents for which a new encoding had to be devised. It
should not be impossible to use svn. If it would be the people who set
up the system would have to rule out using svn from the beginning, even
without accute problems, since problems might appear at some time.

The currently implemented mode of operation might very well be usable in
many places but there should be another mode:

- conversions are performed at the server-side. The HTTP protocol
allows to specify the encoding of the transmission. If the server
cannot handle the encoding an error can be returned

- conversion of file data should be optional. The file/directory
  attributes which are already used can be extended to a flag specifying
  encoding of a file. The default might be UTF-8. If a directory has
  an attribute all contained files are encoded this way (unless
  overwritten). The encoding for files can be explicitly specified.

I don't know how well (bad?) this would integrate with svn's framework
but think about it. If the preferred encoding for a project can be
specified this might avoid a large part of the conversions which are
done today. This can avoid the performance problem and the problem of
Unicode coverage. By moving the conversion process into the server the
number of participating converters is reduced to one which means no
diverging results.

There is one place where I don't see a problem with using UTF-8
throughout if this is desirable and this are the places handling file
names.

I do know about the negative effects: the server will have a lot more to
do if the conversions actually have to be performed. But I do think
that the benefits outweigh this. The main benefit being correctness
which should be in my opinion the highest priority since there will be
hopefully millions of people trusting their texts to the system.+

Please think about this and don't dimiss this just because this is not
how it's done today and not because Apache doesn't need it.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

application/pgp-signature attachment: This is a digitally signed message part

Received on Mon Jul 22 08:41:41 2002

This message: [ Message body ]
Next message: Hugh Winkler: "Corrupted repository db, reproducible"
Previous message: Ulrich Drepper: "Re: converting unconvertible UTF-8 data"
Next in thread: Marcus Comstedt: "Re: Where to convert?"
Reply: Marcus Comstedt: "Re: Where to convert?"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]