[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: 2007-07-15 14:34:34 CEST

On 7/15/07, B. Smith-Mannschott <benpsm@gmail.com> wrote:
>
> On Jul 13, 2007, at 11:20, Thomas Singer wrote:
>
> > First there needs to be consensus *how* to fix it.
>
> This issue *really* annoys me, so I dug around the code some a while
> back despite lacking the C & APR skillz to actually fix it.
>
> http://svn.haxx.se/dev/archive-2007-03/0060.shtml
>
> It looks like SVN just blindly *assumes* that it's getting UTF-8
> (composed) when the underlying file system claims to be UTF-8:

I'm not sure the developers at the time knew about composed and
decomposed Unicode forms. Subversion assumes that all UTF-8 is just
that and that a name encoded in UTF-8 is uniquely identified by its
byte sequence. What we found out later (last year or so) is that this
is not true because of the composed/decomposed forms...

>
> svn_error_t *
> svn_path_cstring_to_utf8(const char **path_utf8,
> const char *path_apr,
> apr_pool_t *pool)
> {
> svn_boolean_t path_is_utf8;
> SVN_ERR(get_path_encoding(&path_is_utf8, pool));
> if (path_is_utf8)
> {
> *path_utf8 = apr_pstrdup(pool, path_apr);
> return SVN_NO_ERROR;
> }
> else
> return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
> }
>
>
>
> Linux systems, as I understand them, just consider file names to be a
> sequence of bytes. They don't normalize the encoding either way. I
> think SVN only works there because the programs/libraries creating
> files on UTF-8 linux systems all just 'happen' to use UTF-8 (composed).
>
> The MacOS does standardize this. A file name is not just a 'bunch of
> bytes', it's always UTF-8 (decomposed).

Right. Many programs under Windows/Linux generate filenames in
composed form. Because OS X standardized on decomposed, that's why it
is being hurt by this convention. Maybe if you added the files on OS X
and worked with them on Linux/Windows, this problem doesn't occur, but
I'm not sure, because I don't know if they actually standardize on
composed form.

> So, the proper thing to do is probably translate from UTF-8
> (decomposed) to UTF-8 (composed) at the interface between SVN and the
> underlying file system when running on a Mac, no?

Well, if the internal format gets an extra requirement
(composed/decomposed), we should make sure all input is actually in
that form - on all platforms, even if the input claims to be UTF-8.

> What would be wrong with solving the problem like this?:
>
> svn_error_t *
> svn_path_cstring_to_utf8(const char **path_utf8,
> const char *path_apr,
> apr_pool_t *pool)
> {
> svn_boolean_t path_is_utf8;
> SVN_ERR(get_path_encoding(&path_is_utf8, pool));
> if (path_is_utf8)
> {
> *path_utf8 = apr_pstrdup(pool, path_apr);
> if (PLATFORM_USES_DECOMPOSED_UTF8)
> {
> normalize_utf8_composed(path_utf8);
> }
> return SVN_NO_ERROR;
> }
> else
> return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
> }
>
>
> void
> normalize_utf8_composed(const char **path_utf8)
> {
> /* ... and then a miracle occurs ... */
> }
>
> Have I misunderstood the problem?

No, except that the problem is the part where the miracle occurs...
Someone needs to write it or to find a library which does it for us.

Bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Jul 15 14:34:21 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.