Re: Encoding in our APIs

From: Branko Čibej <brane_at_xbc.nu>
Date: 2005-05-02 00:08:16 CEST

Peter N. Lundblad wrote:

>On Sun, 1 May 2005, [UTF-8] Branko �^Libej wrote:
>
>
>
>>Peter N. Lundblad wrote:
>>
>>where the encoding information comes from. If you change diff to write
>>the headers in the "native" encoding, there's still a good chance that
>>it'll be a different encoding than the contents of the files are in.
>>
>>
>>
>Yes. Did you read ntoes/diff-encodings.txt? This is not "the complete
>solution". We need a reasonable fallback when the file has no encoding
>information (like all or most don't have today). I am implementing that in
>the first step.
>
>
O.K.

>But, when I have your attention;), could you clarify the difference
>between the output encoding and the locale encoding on Windows? For our
>normal output, such as messages, we use the output encoding (i.e. console
>code page). Is that what oyou want to do for diffs as well?
>
>
This is a tough question, and I'm not sure I know the right answer.

First, the background about encodings on Windows. Windows uses two (or
three...read on) different multibyte encodings at the same time. The
first is the so-called ANSI encoding, which is the locale-dependent
single- or multibyte encoding used by "native" Windows applications. The
second is the "OEM" encoding, also locale-dependent, used for
compatibility with DOS applications. "But isn't DOS dead?" I'd have
thought so, too...

The point about DOS compatibility is that the OEM encoding is used *by
default* for console I/O. This is complicated by the fact that the
console input and console output encodings can be different and changed
independently at runtime. Internally, Windows programs use the ANSI
encoding, and most Win32 API functions assume that encoding.

All this ignores Unicode, which, since the advent of Windows NT, is the
preferred "native" coding system for Windows apps. Earlier versinos of
Windows NT used the UCS-2 encoding, but that silently changed to UTF-16
(both little-endian) in Windows 2000 (not that there's much difference
in practice, except for the surrogate pair range).

There are some interesting twists here as far as Subversion is concerned:

First, APR on Windows NT does not use any of the above-mentioned
encodings -- it uses UTF-8. This ie because APR uses narrow character
types internaly (char, not wchar_t), so UTF-8 was a natural choice as
it's easily convertible to UTF-16. On Windows 9x/Me, APR uses the ANSI
encoding. APR makes the choice at runtime, so the same binaries will use
different internal string representations on different Windows systems.
Subversion follows APR's lead here.

Second, even though the encoding used for console I/O can be changed at
runtime, Subversion does not do that. In an earlier version, we used to
change the console encoding to be the same as the ANSI encoding, but
that turned out to be a bad idea because the SVN command-line doesn't
"own" the console, and the encoding used by a particular console window
isn't specific to the running process, i.e., if you change it in an svn
command, it doesn't revert to the previous value after the command has
completed. That was especially embarrassing when Subversion was used
from a cygwin shell, which has its own ideas about what the console
encoding is supposed to be...

The really big problem for "svn diff" is that, unlike most other
commands, it produces output before the command-line client has a chance
to convert it (but you already know that :). In most cases, the internal
(usually UTF-8) strings are converted to whatever the console encoding
is inside the svn_cmdline_printf functions. We can't do this with a diff
stream (and I suspect blame has similar problems).

-- Brane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon May 2 00:08:53 2005

This message: [ Message body ]
Next message: Brian Huddleston: "Bug/Inconsistent Behavior in svnlook?"
Previous message: Marcus Rueckert: "[RFC] Authz_svn support for svnserve/file"
In reply to: Peter N. Lundblad: "Re: Encoding in our APIs"
Next in thread: Peter N. Lundblad: "Re: Encoding in our APIs"
Reply: Peter N. Lundblad: "Re: Encoding in our APIs"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]