Re: Proposal for supporting WC file content encoding

From: Jesper Moller <jesper_at_selskabet.org>
Date: 2006-03-30 09:24:42 CEST

Philip Martin wrote:

>Julian Foad <julianfoad@btopenworld.com> writes:
>
>
>>Jesper Steen Møller wrote:
>>
>>
>>>I'm proposing to add functionality for "handling" encoding in the
>>>text content which Subversion handles.
>>>
>>>
>>This proposal looks generally quite promising, with the potential to
>>introduce some useful and practical behaviours, but I'm not exactly
>>sure what you are aiming to achieve. You wrote about the
>>implementation method that you have chosen, but did not say what you
>>want users to be able to do, or why. What are the user-oriented
>>goals? To help describe the goals, it might be helpful to include
>>some "use cases", i.e. realistic concrete examples (like transcripts)
>>that demonstrate the various ways in which the user can interact with
>>this feature.
>>
>>
>For example if someone were to use svn:encoding="iso-8859-1" to
>produce a working file in iso-8859-1 the file in the working copy
>would be exactly the same as it is today without your new feature.
>
>
True. With knowledge of the right encoding, it could even be served
correctly by the HTTP repository access (and other such clients like
ViewVC), in the right encoding. However, that would not enable diff and
merge for non-ASCII based encodings. The proposed feature would do just
that.
Mabye we should discriminate between "managed" encoding
(svn:force-encoding=xxx, where Subversion does conversion) and
"declared" encoding (example above, where Subversion just uses it in
e.g. Content-Type: " + prop("svn:content-type") + ";charset="
+prop("svn:encoding").
See also <http://lists.debian.org/debian-devel/2005/08/msg01458.html>

>The svn:encoding="native" would have some effect, but it's not clear
>to me how useful it would be. You mentioned Java source; I don't know
>a great deal about Java but ISO C source code can also, in theory, be
>written in any encoding. While such source can be converted from one
>encoding to another automatically it usually requires human review to
>ensure that the meaning of the code is preserved.
>
>
A common area where this is important is in the comments (although Java
allows identifiers in full Unicode, this is rarely used) - a compiler
expecting UTF-8 is right to bark at ISO-8859-1. This is a major use
case, and I actually thought Subversion did this already (after all, it
works for file names and log messages). The issue that got me started on
this is this bug in Eclipse WTP:
<https://bugs.eclipse.org/bugs/show_bug.cgi?id=132898>
OpenSource projects are generally vulnerable with all the international
and cross-platform co-oporation going on.

The key point about Unicode is that you can round-trip ANY character set
to/from UTF-8, without any need for human review. See
<http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a really good (if
longish) explanation af Unicde and UCS.

If the full proposal is too much, we could initially consider only
svn:encoding="native". It's just that the generalization is (almost) free.

-Jesper Steen Møller

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Mar 30 08:54:17 2006

This message: [ Message body ]
Next message: Mathias Weinert: "Re: [PATCH] show log message before changed paths in mailer.py"
Previous message: Lieven Govaerts: "RE: [PATCH] fix for issue 2475: ignore case for hostnames (repost)"
Maybe in reply to: Jesper Steen Møller: "Proposal for supporting WC file content encoding"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]