[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: question about subversion 1.9 unicode normalization status

From: Branko ÄŒibej <brane_at_wandisco.com>
Date: Wed, 12 Aug 2015 11:00:54 +0200

On 12.08.2015 00:11, Dave Huang wrote:
> On Aug 11, 2015, at 15:35, Branko ÄŒibej <brane_at_wandisco.com> wrote:
>> On 10.08.2015 18:46, Attila Soki wrote:
>>> hi,
>>>
>>> i saw the entry "reimplement UTF-8 fuzzy conversion using utf8proc (r1511676)"
>>> in the changelog and hoped this would be the fix for
>>> http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>>>
>>> but after a quick test it seems to be still broken.
>> In my not even a bit humble opinion, what's broken is Apple's HFS, not
>> Subversion.
> Exactly what is broken in Apple's HFS? MacOS uses one of the Unicode Normalization Forms. Perhaps it's not the same one that Windows uses, but there's nothing wrong with that.

Yay for misunderstandings. :)

The problem with HFS is that it normalizes paths: regardless of how your
file names are (de)normalised when you create them, they're stored in
HFS in NFD form.

For example, if someone on Linux or Windows creates a file named
"grölsch" and commits it, the Subversion client on the Mac will get a
broken working copy on the next update: you'll see "grölsch" on disk and
"grölsch" in the working copy database, but they'll be different strings.

FWIW, HFS is the only filesystem I'm aware of that does this. Every
other filesystem, including all Windows filesystems, store and return
paths in the exact form they're given. This is true of mounted
filesystems on OSX, too; if you mount a remote ext4 filesystem via NFS,
it will behave differently in this respect than a native HFS volume. The
problem isn't even specific to Subversion; it's encountered by any
software on OSX that has to interact with other filesystems.

This is broken. The filesystem should not be in the business of changing
the (meta)data that it's supposed to store.

> While it's unfortunate that SVN didn't handle this correctly from the start, it doesn't make it Apple's fault.

See above. It's a fundamental design bug that ignores the common sense
of all other filesystem implementations.

> Unicode 2.0 talked about normalization/canonicalization in 1996, and TR 15 has been around since about the same time--both predating SVN's development by years. Of course, most people weren't thinking about Unicode back then, and a filename was considered to be some opaque string of bytes, so I don't particularly blame SVN either. If anything, Unicode should've just declared one canonical form instead of giving options. But while HFS(+) is old and is due for an overhaul, its use of Unicode NFD isn't broken.

So I'll skip commenting on all this because it's based on a fundamental
misunderstanding of what we're seeing here. Suffice it to say that
normalizing Unicode representations in databases is a very, very bad idea.

The bottom line is: to work around this bug, Subversion needs to make
changes on both the client side, which implies rather fundamental
changes in the working copy structure; and on the server side, to handle
requests made by older clients.

I'm working on this, but slowly because the changes are potentially very
destructive and there are other, far more important things to do.

-- Brane
Received on 2015-08-12 11:01:19 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.