[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

investigations on issue 2464 (utf8proc library?)

From: B. Smith-Mannschott <benpsm_at_gmail.com>
Date: 2007-03-04 22:31:46 CET

In the matter of Issue 2464 (Canonicalize / stringprep UTF-8 filenames
to handle composed / decomposed differences shown by e.g. Mac OS X
HFS+)

I spent about 5 hours today poking around in the SVN sources. It
wasn't a very productive 5 hours because my knowlege of C doesn't go
much beyond K&R. My knowlege of the SVN externals is nil.

The main reason I inflicted this pain on myself is because this bug
annoys the heck out of me. It prevents me from making productive use
of my Mac notebook at work. (We store human-readable documentation in
our archive. I don't think I'm going to get my German colleagues to
stop spelling file names the way they should be spelled in German,
sorry.)

To the uninitiated the internals of subversion are ... quite baffling,
but I finally arrived at path.c and utf.c, which I could have had much
faster if I'd started form the bottom by looking at file names instead
of starting at the top (with status-cmd.c)

path.c

         svn_path_cstring_from_utf8(...)
         svn_path_cstring_to_utf8(...)
         ...and friends

I think I understand that if we determine that APR is running on a
UTF-8 OS, then no string conversion is performed. e.g::

   svn_error_t *
   svn_path_cstring_to_utf8(const char **path_utf8,
                            const char *path_apr,
                            apr_pool_t *pool)
   {
     svn_boolean_t path_is_utf8;
     SVN_ERR(get_path_encoding(&path_is_utf8, pool));
     if (path_is_utf8)
       {
         *path_utf8 = apr_pstrdup(pool, path_apr);
         return SVN_NO_ERROR;
       }
     else
       return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
   }

In the case of Mac OS X, the conversion is not done since it's already
using UTF-8. We just copy the bytes as they are with apr_pstrdup() or
similar.

There is a cross-platform compatability problem that arises from this,
however, since not all "UTF-8" systems use the same normalization.
Linux and Windows appear to use Normalization form "C" (composed)
while Mac OS X uses normalization form "D" (decomposed). Strings
represeting the same unicode characters may not be byte-for-byte
identical if they are not normalized the same way.

This happens when the mac is doing "status" on a working copy
containing files with accented characters, such as "ü" since the files
names end up in a hash table, which treats them as byte-oriented
c-strings.

   I don't understand the code in utf.c well enough -- particularly
   get_xlate_handle_node, but I wonder what would happen if we just did
   the tranlation from platform character set to "utf8"
   unconditionally? would this perhaps be a no-op on Linux and Windows
   and tranlsate utf8.C <-> utf8.D on mac?

The existing to_utf8 and from_utf8 routines in path.c should be
understood to map platform encoding not just to "any old" utf8, but to
utf8 norm C, specifically.

Couldn't we do something like this, if we had access to routines which
implement C and D normalization for *UTF-8*?

   svn_error_t *
   svn_path_cstring_to_utf8(const char **path_utf8,
                            const char *path_apr,
                            apr_pool_t *pool)
   {
     svn_boolean_t path_is_utf8;
     SVN_ERR(get_path_encoding(&path_is_utf8, pool));
     if (path_is_utf8)
       {
   #ifdef PLATFORM_IS_UTF8_NORM_D
         char* path_apr = norm_d2c(pool, path_apr);
   #endif
         *path_utf8 = apr_pstrdup(pool, path_norm_c);
         return SVN_NO_ERROR;
       }
     else
       return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
   }

Apple provides a call to do this kind of conversion, as documented at
http://developer.apple.com/qa/qa2001/qa1235.html . Unfortunately, it
uses either UCS-2 or UTF-16 (I can't tell from the header file). At
any rate, it's not UTF-8 so it would require additional converting and
byte shuffling.

Perhaps this would be a viable alternative:

   http://www.flexiguided.de/publications.utf8proc.en.html

The license appears to be "free" of one sort or another.

// Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Mar 4 22:32:10 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.