Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents

From: Paul Burba <paulb_at_softlanding.com>
Date: 2006-02-22 22:40:59 CET

Julian Foad <julianfoad@btopenworld.com> wrote on 02/22/2006 01:50:32 PM:

Hi Julian,

Ok, here goes...

> Having read through this patch a few times, I feel I'm not getting the
whole
> story about why this patch seems to be such an ugly work-around,
creating the
> file in one mode, closing it and then re-opening it another mode. I'll
> describe how I see it. Correct me when I go wrong.
>
> > [[[
> > Every file in OS400 is tagged with a CCSID which represents its
> > character encoding. On OS400 V5R4 when apr_file_open() creates a
> > file, its CCSID varies depending if the APR_BINARY flag is passed or
> > not. If APR_BINARY is passed, the file is created with CCSID 37
> > (EBCDIC), if not, it has CCSID 1209 (UTF-8).
>
> 1208 not 1209?

That's just a typo, 1208 is what I intended.

> So this "UTF-8 on EBCDIC system" version of APR has decided that "text"
files
> have UTF-8 content within the application and thus should be stored as
UTF-8,
> tagged as UTF-8.

If I understand you correctly that's right, I'd explain it like this:

- APR thinks text files have content encoded per the file's CCSID.

- APR on OS400 V5R4 is built with what IBM calls "UTF support"
so "within the application" text file contents are in CCSID
1208 (UTF-8).

- If APR creates a NEW file WITHOUT the APR_BINARY flag the file
has a CCSID of 1208 (UTF-8).

- If APR creates a NEW file WITH the APR_BINARY flag the file
has a CCSID of 37 (EBCDIC).

- If APR opens an EXISTING file WITH the APR_BINARY flag the
file's CCSID is ignored, reads/writes to the file are just bytes.

- If APR opens an EXISTING file WITHOUT the APR_BINARY flag and
reads from the file, the OS attempts to convert the file's contents
from the file's CCSID to 1208.

- If APR opens an EXISTING file WITHOUT the APR_BINARY flag and
writes to the file, the OS attempts to convert the bytes written
from 1208 to the file's CCSID.

> The tag for a text file must be correct otherwise other applications -
text
> editors, etc. - would read the file as garbage, even though the
originating
> application might be able to read it by ignoring the tag.

Exactly. That's the reason for this patch. We've used this approach
since the original EBCDIC port. Essentially, since subversion is creating
only UTF-8 and/or binary files we want the files tagged correctly so tools
outside of subversion that read/write from/to them have a fighting chance
to operate correctly.

Consider a file like svnserve.conf or a hook script template which a user
opens in a text editor. Without this patch the file is tagged as CCSID
37. When the user opens the file with a text editor they will likely get
an error. Or worse, their text editor might convert whatever they enter
to EBCDIC when they save it.

Since the APR_BINARY flag is now used on all apr_file_open() calls, in all
likelihood subversion would still work correctly on OS400 since no OS
translation will occur based on file CCSID. We would have to test this to
be sure. That being said, it still makes a lot of sense for the files to
be correctly tagged so that other applications can use subversion created
files correctly.

> Applications on OS400 generally need a way to read and write EBCDIC
files
> (correctly tagged). APR decides that the "APR_BINARY" flag is goingto
select
> EBCDIC mode rather than the more logical "binary" or "unknown" content
> encoding. Oops. Now what is an application supposed to do that
> wants to write files that are neither UTF-8 nor EBCDIC?

IBM docs talk about a CCSID, 65353 used to indicate "untagged or
hexadecimal data" but we've never seen it used. I wish I had an answer
for you as to why it isn't used. The logic behind CCSID 37 for a binary
file is somewhat understandable though, since the OS thinks the file is in
EBCDIC and won't do any conversions on it when reading or writing. So you
read a binary file, you just get the actual bytes, no text-aware
conversions.

> In APR_BINARY mode, this APR translates between EBCDIC on disk and
> UTF-8 on the
> application side when reading and writing, so we can't use this mode
> for binary
> files. No, I must have gone wrong already; that would be too silly.APR
must
> behave differently on file create from how it behaves on other file
> operations.
>
> Please tell me more.

Yeah, it's just the creation of new files that is the problem, hopefully
my comments above make this somewhat clearer.

> > Since subversion creates files with either binary or UTF-8 content and
> > all calls to apr_file_open() in subversion use APR_BINARY, these files
> > are incorrectly tagged.
>
> So the solution is to tag all new files as UTF-8 (achieved by creating
them
> without APR_BINARY), and then reading/writing them with APR_BINARY. That
will
> work for both UTF-8 and binary/unknown files because the encoding tag is

> correct for text files and irrelevant for binary/unknown files, and no
> translation will be done, yes? No, that doesn't make sense.
> APR_BINARY means
> EBCDIC on disk, and therefore translation during read/write, doesn't it?

> Aargh!

Ok, we agree on that.

> Is this all just a bug in APR?

> > Simply not using APR_BINARY on OS400 when opening a file isn't an
> > option, because in this case the OS attempts to convert the file's
> > contents from its CCSID to UTF-8 when reading the file and vice-versa
> > when writing to it. This has obvious problems if the file contains
> > binary data.
> >
> > This patch ensures files *created* via svn_io_file_open() and
> > svn_io_open_unique_file2() are tagged with a CCSID of 1208.
>
>
> > +/* Helper function for apr_file_open() on OS400.
> > + *
> > + * When calling apr_file_open() with APR_BINARY and APR_CREATE on
OS400
> > + * the new file has an ebcdic CCSID (e.g. 37). But the files created
by
>
> Did you mean "37" or "i.e. 37" instead of "e.g. 37"?

No, I meant e.g.. 37 is the CCSID for COM EUROPE EBCDIC which is what the
machine I work on uses. There are other EBCDIC variants like 273
AUSTRIAN/GERMAN EBCDIC. On a machine using the latter, I'm assuming that
apr_open_file() with APR_BINARY and APR_CREATE would result in a file with
the CCSID of 273. Hence the e.g..

> > + /* Whether or not APR_EXCL is set or not, we want to unset it
before the
>
> Too many "or not"s.

Now that I can explain! It's a case of poor proofreading. I can even fix
it really easy.

Paul B.

_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. and SoftLanding Europe Plc by IBM Email Security Management Services powered by MessageLabs.
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Feb 22 22:41:34 2006

This message: [ Message body ]
Next message: Garrett Rooney: "Re: Ways to keep users from checking out too much."
Previous message: Jim Blandy: "Re: Ways to keep users from checking out too much."
In reply to: Julian Foad: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"
Next in thread: Julian Foad: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"
Reply: Julian Foad: "Re: [PATCH] #6 OS400/EBCDIC Port: Prevent OS conversion of file contents"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]