Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: 2006-02-22 23:59:05 CET

Paul Burba wrote:
> Julian Foad <julianfoad@btopenworld.com> wrote on 02/22/2006 01:50:32 PM:
>>
>>>[[[
>>>Every file in OS400 is tagged with a CCSID which represents its
>>>character encoding. On OS400 V5R4 when apr_file_open() creates a
>>>file, its CCSID varies depending if the APR_BINARY flag is passed or
>>>not. If APR_BINARY is passed, the file is created with CCSID 37
>>>(EBCDIC), if not, it has CCSID 1209 (UTF-8).

Thanks very much for the detailed explanation, Paul. I now understand and
accept the design of this patch.

>>So this "UTF-8 on EBCDIC system" version of APR has decided that "text" files
>>have UTF-8 content within the application and thus should be stored as UTF-8,
>>tagged as UTF-8.
>
> If I understand you correctly that's right, I'd explain it like this:
>
> - APR thinks text files have content encoded per the file's CCSID.
>
> - APR on OS400 V5R4 is built with what IBM calls "UTF support"
> so "within the application" text file contents are in CCSID
> 1208 (UTF-8).
>
> - If APR creates a NEW file WITHOUT the APR_BINARY flag the file
> has a CCSID of 1208 (UTF-8).
>
> - If APR creates a NEW file WITH the APR_BINARY flag the file
> has a CCSID of 37 (EBCDIC).

I suppose that's OK, if you think of it as just a default CCSID that the many
OS400 applications might want to use, but it's certainly not the value that
every caller will want, as you have discovered.

> - If APR opens an EXISTING file WITH the APR_BINARY flag the
> file's CCSID is ignored, reads/writes to the file are just bytes.
>
> - If APR opens an EXISTING file WITHOUT the APR_BINARY flag and
> reads from the file, the OS attempts to convert the file's contents
> from the file's CCSID to 1208.
>
> - If APR opens an EXISTING file WITHOUT the APR_BINARY flag and
> writes to the file, the OS attempts to convert the bytes written
> from 1208 to the file's CCSID.

Right - all these other points make sense now.

>>Applications on OS400 generally need a way to read and write EBCDIC files
>>(correctly tagged). APR decides that the "APR_BINARY" flag is goingto select
>>EBCDIC mode rather than the more logical "binary" or "unknown" content
>>encoding. Oops. Now what is an application supposed to do that
>>wants to write files that are neither UTF-8 nor EBCDIC?
>
> IBM docs talk about a CCSID, 65353 used to indicate "untagged or
> hexadecimal data" but we've never seen it used. I wish I had an answer
> for you as to why it isn't used. The logic behind CCSID 37 for a binary
> file is somewhat understandable though, since the OS thinks the file is in
> EBCDIC and won't do any conversions on it when reading or writing. So you
> read a binary file, you just get the actual bytes, no text-aware
> conversions.

Er... unless you read the file on a machine running the German variant of
EBCDIC, in which case a few of the bytes do get converted :-) This doesn't
seem like something that can be relied upon. Presumably all applications that
don't need to interpret a file as text will open it in "binary" mode, like they
do on MS Windows, to avoid this problem.

If CCSID 37 is the de-facto default for "binary" files, then it's fair enough
for APR to apply that default, but I'd have thought APR would need to provide a
way to choose a different one.

> Yeah, it's just the creation of new files that is the problem, hopefully
> my comments above make this somewhat clearer.

OK, I understand now. This is how I would describe the situation:

* Some of the files Subversion handles have UTF-8 content, and others are
"binary" meaning the content and encoding is unknown. We want to read and
write all files in "binary" mode, because we haven't structured the code to
distinguish between the two cases.

* We want the CCSID tag to be "UTF-8" because that will be correct for the
files that are, and for the other files it is irrelevant or at least no worse
than any other fixed value.

* APR marks newly created "binary-mode" files as "EBCDIC". We want to override
this default and set the CCSID tag to "UTF-8".

* APR marks newly created "text-mode" files as "UTF-8".

* One way to get our new binary files marked as "UTF-8" is to create them in
APR's text mode and then re-open them in binary mode. This is a method that
happens to work for the CCSID that we want. If we had wanted any other CCSID
we'd have had to find a different way of setting it.

If you could add something condensed from the above to the new function's doc
string or implementation comments, that would dispel the reader's puzzlement
over why the file is created, closed and re-opened rather than, say, created
with the desired CCSID in the first place or changed to the desired CCSID after
creation.

I think APR is going to need a more general way of requesting a particular
CCSID on a newly created "binary" file. However, if IBM don't have any actual
customers who want to create non-EBCDIC files in binary mode, then they'll have
no reason to add such a facility, so we may always have to use this method.

>>>+ * When calling apr_file_open() with APR_BINARY and APR_CREATE on OS400
>>>+ * the new file has an ebcdic CCSID (e.g. 37). But the files created by
>
>>Did you mean "37" or "i.e. 37" instead of "e.g. 37"?
>
> No, I meant e.g.. 37 is the CCSID for COM EUROPE EBCDIC which is what the
> machine I work on uses. There are other EBCDIC variants like 273
> AUSTRIAN/GERMAN EBCDIC. On a machine using the latter, I'm assuming that
> apr_open_file() with APR_BINARY and APR_CREATE would result in a file with
> the CCSID of 273. Hence the e.g..

Right, OK. If you could phrase it the same way in the log message (if it needs
to be mentioned there at all) that would reduce confusion.

Apart from Philip's comments, now I'm happy with this approach or his suggested
variants of it, and the only other little thing I'd say is also about the log
message: begin it with an introduction that describes and justifies the change
briefly, like your other ones have done.

Thanks again.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Feb 22 23:59:34 2006

This message: [ Message body ]
Next message: Brian Behlendorf: "Re: Ways to keep users from checking out too much."
Previous message: Justin Erenkrantz: "Re: Ways to keep users from checking out too much."
In reply to: Paul Burba: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"
Next in thread: Paul Burba: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"
Reply: Paul Burba: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]