Porting Subversion to EBCDIC

From: Mark Phippard <MarkP_at_softlanding.com>
Date: 2005-02-10 20:38:15 CET

In writing a reply to Julian about our use of escape characters in our
EBCDIC port, it seemed like a good time to offer up a more detailed
explanation of what we are doing and where we are.

You can download our current patch against the 1.1.x branch here:

http://support.softlanding.com/ebcdic.diff

The patch is 362K

We are working against the branch because we needed something stable to
work off and we also plan on releasing this pretty soon to OS/400 users.
Currently, we use svk to keep a mirror of the Subversion repository that
we can commit against and take advantage of Subversion for managing this
work. It would be great if we could get an "ebcdic" branch in the real
repository where we could start hosting this code. It would make it
easier for us when we need to catch up to trunk as I am currently only
mirroring the branch.

The patch is still a work in progress. Our immediate goal is only to
allow a Subversion server to be hosted on OS/400. With that in mind, we
have not touched ra_dav, or most of the client or working copy code. We
also have not ported BDB, and are just planning on using fsfs. The patch
is complete for svnadmin and svnserve. We are still in the process of
getting mod_dav_svn working, although we have made a lot of progress in
recent weeks. It is by far the most challenging part of the port as so
much of the code is out of our control.

With all of that out of the way, here is our attempt to explain what we
are doing, and the issues we have had to solve in porting to EBCDIC. One
final thing to keep in mind. On OS/400, Apache and APR are supplied by
IBM and are not completely open source. We have to live with what they
give us and how they have implemented things.

Our patch follows a few general assumptions to run in an ebcdic
environment::

A) We separate the code into four logical groupings:

MOD_DAV_SVN: \subversion\mod_dav_svn\*.c

SVN: All subversion code that's not in \subversion\mod_dav_svn\

APACHE: Mod_dav and Apache code.

APR: APR and C Standard Library functions.

B) MOD_DAV_SVN - Here we assume all strings/chars are ebcdic.

C) SVN - Here we assume all strings/chars are utf-8.

D) APACHE & APR - Here we assume all strings/chars are ebcdic. Strings
passed to these functions may need to be in ebcdic if the semantics of the
string matter. e.g. calling atoi(const char *str) needs an ebcdic encoded
str to work properly, but strchr(const char *str, int c) is just searching
for a byte pattern and doesn't care what encoding is used.

E) Strings passed between these groups may need conversion:

ebcdic ------------> utf-8
MOD_DAV_SVN --> calls --> SVN

   utf-8 ------------> ebcdic
   SVN --> calls --> APR

   "Conversion" may involve strings passed as arguments, strings returned
by the function, or char ** args. You may be asking, "How does one
convert non-ascii utf-8 to ebcdic without losing information?" IBM uses a
"utf-8esque" encoding scheme similar to unicode's utf-ebcdic
specification.

F) Strings passed between these groups should share the same encoding and
need no special handling:

   ebcdic ------------> ebcdic
   APACHE --> calls --> MOD_DAV_SVN
   MOD_DAV_SVN --> calls --> APACHE
   MOD_DAV_SVN --> calls --> APR
__________________________________________________

To meet these assumptions we use four core approaches:

1) "Global" symbolic constants in svn_utf.h for commonly used char and
string literals, where the literal is a hex-escaped ascii value:

   e.g. #define SVN_UTF8_FSLASH '\x2F' /* '/' */
        #define SVN_UTF8_FSLASH_STR "\x2F" /* "/" */

   At the time this seemed the most logical place to put these, but
svn_ctype.h probably makes more sense. Where the code implicitly assumes
ascii values when using char or string literals, these would be used
instead.

   e.g. in path.c's const char *svn_path_internal_style (const char *path,
apr_pool_t *pool):
        - if ('/' != SVN_PATH_LOCAL_SEPARATOR)
        + if (SVN_UTF8_FSLASH != SVN_PATH_LOCAL_SEPARATOR)

2) Also in svn_utf.h, ascii aware macros to replace apr_isalpha,
apr_isdigit, apr_isspace, apr_isxdigit, and tolower if compiled on an
ebcdic system (determined by the value of APR_CHARSET_EBCDIC from apr.h).

   e.g. #if !APR_CHARSET_EBCDIC
          #define APR_IS_ASCII_DIGIT(x) apr_isdigit(x)
        #else
          #define APR_IS_ASCII_DIGIT(x) ( (unsigned char)x >= SVN_UTF8_0
&& \
                                          (unsigned char)x <= SVN_UTF8_9 )
        #endif

   Where the code calls these functions, the apr_* call is replaced with
the macro.

3) "Private" symbolic constants in *.c files for commonly used string
literals in that file, where the literal is a hex-escaped ascii value:

   e.g. In fs_fs.c:
        /* Names of special files and file extensions for transactions */
        #define PATH_CHANGES \
        "\x63\x68\x61\x6e\x67\x65\x73"
        /* "changes" - Records changes made so far */

   We didn't put these in svn_utf.h per approach 1 as it seemed the list
would become absurdly large. Nor did we want to have multi-line hex
escapes cluttering the code.

4) Large blocks of string literals are converted to utf-8 with IBM's
convert pragma.

   e.g. #if APR_CHARSET_EBCDIC
        #pragma convert(1208)
        #endif
         static const char * const readme_contents =
         "This is a Subversion repository; use the 'svnadmin' tool to
examine"
         APR_EOL_STR
         ...
         "Visit http://subversion.tigris.org/ for more information."
         APR_EOL_STR;
        #if APR_CHARSET_EBCDIC
        #pragma convert(37)
        #endif

5) APR_CHARSET_EBCDIC dependent code blocks in the subversion code convert
strings where assumption 'E' is relevant.

   e.g. In fs_fs.c's read_rep_offsets (representation_t **rep_p, char
*string, const char *txn_id, svn_boolean_t mutable_rep_truncated,
apr_pool_t *pool), SVN_STR_TO_REV (which is just atol) needs an ebcdic
string:
        ...
          str = apr_strtok (string, SVN_UTF8_SPACE_STR, &last_str);
          if (str == NULL)
            return svn_error_create (SVN_ERR_FS_CORRUPT, NULL,
                                     _("Malformed text rep offset line in
node-rev"));
        #if APR_CHARSET_EBCDIC
          SVN_ERR (svn_utf_cstring_from_utf8 (&str, str, pool));
        #endif
          rep->revision = SVN_STR_TO_REV (str);
        ...

To answer some of Julian's specific questions:

> How extensive is this? Did you just need to do this for a few odd
characters
> here and there, or does this involve replacing hundreds of literal
strings and
> characters all over the code base?

1 through 4 are fairly extensive, but they are straightforward and less
intrusive in the sense that understanding what they are doing is easy.
Number 4 is more intrusive, but maybe not as bad as imagined. Code that
sits "between" the groups described in A have a lot of APR_CHARSET_EBCDIC
dependent blocks; e.g. fs_fs.c has 36 blocks. Code that operates within a
group have few, if any; e.g. tree.c has none. These are off-the-cuff
examples, we haven't done an in depth statistical analysis or anything.

> Bear in mind that I don't know whether
> EBCDIC has any overlap with ASCII, or what your other options are (such
as
> controls to make certain parts be compiled with ASCII as the execution
> character set).

We have explored, and continue to explore, other approaches, but the
above, as intrusive as it may seem, has shown the most promise so far. Our
elementary problem is that IBM Apache/MOD_DAV sends MOD_DAV_SVN a
request_rec with ebcdic strings and wants ebcdic strings sent back to it
(which are converted to utf-8 before being sent out on the wire). On the
other hand we have repository files that contain utf-8 content. Barring
some way of making IBM Apache run in a utf-8 environment (which has been a
dead end thus far) somewhere between the two we need to convert strings.
We are very open to better ideas on where to do this and welcome any
feedback or suggestions, but this is where we stand today.

We have actually gone pretty far on several different approaches to this
port, including "building a wall" around Subversion that made it think it
was just working on a UTF-8 system. This approach was less intrusive on
the code, but added a lot of extra string conversion and also completely
fell apart when we got to mod_dav_svn. Brane suggested we just let
Subversion do the conversion and that inspired us to start over with the
above approach, which has yielded much better results.

We would certainly welcome any feedback on the approach, I realize it is a
lot to review. Also, I will just come back to whether it would be
possible to establish an ebcdic branch where we could work on this, and
whether it seems like now would be a good time to start that process.

Thanks

Mark

_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs.
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Feb 10 20:39:37 2005

This message: [ Message body ]
Next message: Matthew Rich: "Re: PHP Language Bindings for Subversion"
Previous message: Karl Heinz Marbaise: "Re: PHP Language Bindings for Subversion"
Next in thread: C. Michael Pilato: "Re: Porting Subversion to EBCDIC"
Reply: C. Michael Pilato: "Re: Porting Subversion to EBCDIC"
Maybe reply: Mark Phippard: "Re: Porting Subversion to EBCDIC"
Reply: Julian Foad: "Re: Porting Subversion to EBCDIC"
Reply: Branko ÄŒibej: "Re: Porting Subversion to EBCDIC"
Maybe reply: Mark Phippard: "Re: Porting Subversion to EBCDIC"
Maybe reply: Mark Phippard: "Re: Porting Subversion to EBCDIC"
Maybe reply: Mark Phippard: "Re: Porting Subversion to EBCDIC"
Maybe reply: Mark Phippard: "Re: Porting Subversion to EBCDIC"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]