[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Matthias Wächter <matthias.waechter_at_tttech.com>
Date: 2007-07-23 16:27:59 CEST

On 23.07.2007 12:20, Erik Huelsmann wrote:
> On 7/23/07, Matthias Wächter <matthias.waechter@tttech.com> wrote:
>> On 21.07.2007 01:42, Daniel A. Steffen wrote:
>> Right. Keepling a local 'matching table' between repository vs.
>> local file names could also be a solution for Windows users that are
>> busted with repositories containing file with the same name, once
>> lower case, once upper case.
> This won't help: in the light of network mounts/drives, you can't be
> sure a drive on Windows is a Windows filesystem... You could be
> writing to an HFS+ drive.

One can imagine a lot of--old or new--file systems that don't adhere
to a 1:1 Unicode file name mapping for whatever reason. Any old
iso-8859-x 8-bit file system doesn't support a (complete) 1:1 file
name mapping for Unicode, Windows does not allow uppercase/lowercase
equivalence, MAC normalizes in their own one-way, and what's behind
the network is hard to get a clear impression of, automatically.

But there _are_ file systems that are tolerant to the Unicode file
names and whether normalization is used or not. I don't see a good
cause for Subversion to enforce a normalization to those.

Pragmatically, one could see Subversion as a 'file system' on its
own. See it as a mounted 'network drive' you exchange data with
every time you do an update, commit etc. If you must consider
problems between this file system and your local representation, you
have to use some glue inbetween for proper file name exchange and
translation. And I think, this glue must be an option for the client
which connects these two file systems. No-one should be forced to
use normalization within Subversion if he doesn't have good cause
for it.

> Treating this problem as a case-sensitivity issue is not really fair
> to the problem: there are 2 file names which mean exactly the same
> thing. While with case sensitivity users can actually *see* the
> difference between path names, here, it's not the case. It is not
> even *meant* to be the case: Unicode assigns the same meaning to
> "u" + "last letter with umlaut" and "u with umlaut", it's only the
> binary values that differ.

Asking whether NFC/NFD it is *meant* for being the same is more a
matter of taste than required by the standard, AFAICS. Those are
simply different representations of the same output on screen, and
normalization like NFC/NFD allows comparison between them.

See one of my last posts about that. It's not only about
normalization (NFC, NFD), as every Unicode string can be represented
by _any number_ of different Unicode strings using nonprintable
characters, direction reversing characters and so on. Just because
most UI applications don't show mixed-direction strings in file
names correctly doesn't mean that we have a good position ignoring
that issue. Btw, these control characters and strings modified by
them are _not_ covered by normalization at all.

Asking for Subversion to give equivalence for all Unicode file names
that appear the same on screen is nonsense. Just to be sure that I'm
correctly understood: I don't say that anyone has made this
pragmatic request, but it's the logical consequence of this
normalization debate.

> Subversion should compensate for that and treat
> the different values to mean the same thing.

Subversion could allow an interface to client-side filename
conversion (maybe plugins) that can be _manually_ selected if a user
wants to have a specific translation scheme. And with every
non-perfect translation scheme, there are cases that can lead to
different repository file names resulting in the same local file
name (and vice versa).

>> Then, one of these files could have a
>> slightly different local file name, and both could be checked out,
>> worked with etc.
>
> Why would you want 2 files, one of which is called "&Auml;lter", the
> other "A&uml;lter" and have them both be versioned?

I don't think that Subversion is in the position to argue about
that. I am pretty sure the same discussion was held in the Linux
file system team as well as the Windows file system team as well as
the MAC OS file system team. And only the MAC OS guys agreed to let
an automatic one-way file name normalization like NFD slip into
their library, and that's why we are--more precisely, the MAC OS X
users are--now in that mess.

> So, what happens if a Linux and a Mac client both add the same file,
> with an A-umlaut in it and then they both commit? Currently, they'll
> both be added to the repository. The Linux client will end up with 2
> files with the same name, the Mac client will not be able to update
> anymore.

I don't see these two files with the same looking file name be the
issue. The issue is the Mac client (and its OS). Like the Windows
client is the issue after a Linux user added "Alter" and "alter".

> As long as this is hardly visible on the outside, why souldn't
> Subversion standardize on one or the other In its own little world
> (internally)?

Because it is (a) imperfect as it only covers a very small fraction
of same-looking-but-different Unicode file name issues and (b)
normalization is a one-way function losing information. It is not a
VCS's job to enforce something like that, but it may be an option
for the client to allow some ways of conversion if requested.

- Matthias

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 23 16:27:05 2007

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.