Proposal for supporting WC file content encoding

From: Jesper Steen Møller <jesper_at_selskabet.org>
Date: 2006-03-26 21:52:17 CEST

Dear Subversion-dev,

I'm proposing to add functionality for "handling" encoding in the text
content which Subversion handles.
I've read the discussion on UTF-16 support (as referenced in
http://subversion.tigris.org/issues/show_bug.cgi?id=2194), and I've been
lacking locale aware content encoding myself (in fact, I assumed it was
present already...), and think I the feature can be implemented without
changing too much of the client code. I hope I'm not just stating the
obvious, but I think it could be done in these few steps:

1) Property support for specifying text encoding
2) ASCII-based encoding conversion support in the WC library (along with
current
EOL and keyword handling in 'subst')
3) Extend support to "non-ASCII-based" encodings (like UTF-16, EBCDIC):
4) Adding auto-property support for text encoding

Each step is independent from the next, so that we could stop at any
time, no "big-bang" is neccesary. The proposal is based on four main
principles:
* Its free if you don't need it
* Don't surprise the user
* UTF-8 is the new ASCII ("1112060 code point should be enough for
everybody")
* Must be backwards compatible

The idea is to normalize all text which uses the feature to UTF-8 and
then only convert to/from specified encodings at the outermost level of
the WC handling, like this:

   WC file contents (user specified encoding or locale's)
         ||
  Enriched contents (UTF-8, w/keywords and/or EOL trans)
         ||
   Pristine contents (UTF-8, as stored in FS)

'svn diff' between WC and pristine would convert the WC file up to the
"enriched" level before feeding to the diff libraries (Not sure how this
would
be handled for external diff packages, it might have to save to a temp.
file)

The server (RA level) would only see the UTF-8 versions and would not
need any changes. The client would detect encoding by looking at the
propery, and act accordingly. Old clients would not know this and only
see a UTF-8 file.
Further details about the four steps:

Ad 1) Property support for specifying text encoding:

I propose that we introduce a new property for text files called
svn:text-encoding (or mabye svn:text-encoding-style, or perhaps just
svn:encoding)?
This can take three kinds of values:
- The name of a specific encoding, like ISO-8859-1 or UTF-8
- The special value 'native'
- Empty or missing (the default)

The idea is that IF svn:text-encoding is specified, then the WC library
and the clients in general are responsible for converting to and from to
the specified format (with 'native' being the system's default
encoding), and that the RA level only ever sees UTF-8 for these text
resources. The encoding is said to be "managed".
This follows the style of svn:eol-style and needs support in roughly the
same places.
The 'native' mode is interesting for the case where text files (like
Java source files) do not carry the encoding with them (like e.g. XML does).
If the text-encoding is not set, then the encoding is "unmanaged" in
that it works like it does today.

Ad 2) ASCII-based encoding conversion support in the WC library:

The first step in supporting this would be to add the support into the
WC and client libraries. For 8-bit (ASCII-based) encodings, the basic
support of this doen't touch the diff support, which at this point
already makes some assumptions about the encoding, as far as I can tell.
I think the "streamy" API in svn_subst.c can be layered with the
encoding support. Also diff output should be reflect these encoding
changes, to show "encoding
only" changes:

Index: cool-stuff/todo.txt
===================================================================
--- cool-stuff/todo.txt (revision 42, ISO-8859-1)
+++ cool-stuff/todo.txt (working copy, UTF-8)
svn:text-encoding = UTF-8

Property changes on: cool-stuff/todo.txt
___________________________________________________________________
Name: svn:text-encoding
- ISO-8859-1
+ UTF-8

There are some edge cases to be considered, when the text-encoding
changes from "unmanaged" to "managed" (or back), where the diff engine
would pick up all kinds of "bogus" text changes. This may need special
attention.
Another edge case: Some commit logic should be present to check that a
"managed" file being checked in is in fact valid in the said encoding
(so that careless handling of file encoding won't inadvertently break
the repository data).

Ad 3) Extend this support to "non-ASCII-based" encodings (like UTF-16,
EBCDIC):

Actually this may not be a big issue at all, if the conversions are
added at the right level, since the main diffing engine would always
work on UTF-8 (in fact, it would always work on 8-bit oriented streams
separated by LFs, just like it does now).
The only change I can think of right now is the fixed width keyword
substitution, which today works on bytes, but could work fine on
characters if the knowledge was there.

Ad 4) Adding auto-property support for text encoding:

Plenty of options exist: BOM detection, detecting of UTF-8
leading/trailing bytes, checking for XML declarations, etc. There should
also be a configuration setting for preferring the native encoding
over the detected one (if the detector sees a file encoded with the
encoding which is also the current native encoding).

How does this sound?

-Jesper

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Mar 28 13:41:36 2006

This message: [ Message body ]
Next message: Otto Visser: "issue with relocating"
Previous message: Malcolm Rowe: "Re: Ordering of output for diff tests #29 and #31"
Next in thread: Julian Foad: "Re: Proposal for supporting WC file content encoding"
Reply: Julian Foad: "Re: Proposal for supporting WC file content encoding"
Maybe reply: Jesper Moller: "Re: Proposal for supporting WC file content encoding"
Maybe reply: Jesper Moller: "Re: Proposal for supporting WC file content encoding"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]