Re: Proposal for supporting WC file content encoding

From: Jesper Steen Møller <jesper_at_selskabet.org>
Date: 2006-03-29 07:52:28 CEST

Julian Foad wrote:

> Jesper Steen Møller wrote:
>
>> I'm proposing to add functionality for "handling" encoding in the
>> text content which Subversion handles.
>
> This proposal looks generally quite promising, with the potential to
> introduce some useful and practical behaviours, but I'm not exactly
> sure what you are aiming to achieve. You wrote about the
> implementation method that you have chosen, but did not say what you
> want users to be able to do, or why. What are the user-oriented
> goals? To help describe the goals, it might be helpful to include
> some "use cases", i.e. realistic concrete examples (like transcripts)
> that demonstrate the various ways in which the user can interact with
> this feature.

Sure enough. The case which made me think about this in the first place
is in fact a problem seen with CVS in the Eclipse WTP project, where
most developers were working with their Java source files on Windows and
some developers (and in the concrete example, a build environment) were
using some Unix/Linux with UTF-8 as the native charset. The Java
compiler expects to see the native encoding, and the build failed. Many
other applications (like GCC, etc) expect this behaviour.

This is a situation that is not likely to go away just yet.

While I was drafting a proposal for adding a property just for
svn:text-encoding-style = native (mimicking the EOL stuff), it occurred
to me that I was just dealing with a speicalized case, and I saw that
people were also requesting text support for UTF-16 and UTF-32, but that
it was argued that Subversion basically dealt with text as byte-oriented
character data.

By allowing svn:text-encoding-style = native | <encoding-name> this
would come, almost for free, since current diff/merge functionality
would be pretty much retained (since we'd normalize to UTF-8 before
operating on the files).

> [...]
>
>> 'svn diff' between WC and pristine would convert the WC file up to the
>> "enriched" level before feeding to the diff libraries (Not sure how
>> this would
>> be handled for external diff packages, it might have to save to a
>> temp. file)
>
>
> So 'svn diff' would display its output in UTF-8 regardless of the
> encoding of the files. I can see how this could be useful for people
> wanting a visual display of changes, especially when the diff includes
> files with different encodings. Was that one of your goals? However,
> people often want to use the output of "svn diff" as the input to a
> standard "patch" program, and this would prevent that from working.

It could encode back into the desired text format (on output), so you'd
have the same result as when diffing two WC versions of the file.
You'd get an ambiguity when diffing between a "managed" and "unmanaged"
encoding, though.

> There are already other ways in which diff output best suited for
> viewing is not the best output for using with "patch", such as whether
> to display a file-rename as an all-lines-deleted diff and an
> all-lines-added diff, or just as a statement saying that the file was
> renamed. Maybe we need to introduce a mode switch for "svn diff":
> human-readable mode versus "patch" mode, or preferably "svn patch"
> mode versus "conventional patch" mode.

Yes, that's one useful approach. I will have a look at the most
important corner cases.

>> The server (RA level) would only see the UTF-8 versions and would not
>> need any changes.
>
> When the RA method uses HTTP, I imagine some people will want the
> server to be able to serve the file to generic HTTP clients (web
> browsers) in its native (non-UTF8) encoding.

Yes, even that could be improved:
Today: Everything is just marked with some default (is it Apache's
default, I seem to get ISO-8859-1 for everything?)
The proposal:

Simple solution:
If svn:text-encoding is not set, send some default like today.
If svn:text-encoding is set - set charset=UTF-8 and it will work (since
that's how it's stored).

Advanced solution:
If svn:text-encoding is not set, send some default like today.
If svn:text-encoding is set:
1. Obey the client's setting of Accept-Charset
2. If svn:text-encoding is set to an encoding that the server supports,
convert (if required) and send that encoding.
3. If the native encoding is requested, allow the server to decide which
that would be (I don't really see the sense in this).

-Jesper

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Mar 29 07:50:04 2006

This message: [ Message body ]
Next message: Johannes Holzer: "Re: static build was: 1.3.1 tarballs up for testing/signing (Again)"
Previous message: David Anderson: "Re: svn commit: r19082 - trunk/contrib/hook-scripts"
In reply to: Julian Foad: "Re: Proposal for supporting WC file content encoding"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]