[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Umlaut problem on Mac (composed vs. decomposed UTF-8)

From: Thomas Singer <subversion_at_smartcvs.com>
Date: 2007-07-23 19:55:47 CEST

 From science back to the core problems. IMHO following problems must be
solved (not necessarily more, but definitely not less):

1) it should be possible to handle files with umlauts in its name without
setting the "right" encoding

2) it should be possible to work on files with umlauts in its name on
Windows/Linux as well as on Mac OS X, no matter on what platform they were
added initially; on each platform the usual presentation must be used

Personally I don't care about some border-cases that exist in unicode, but I
care about characters of - at least - the western hemisphere (umlauts,
accents and so on). Let's find a way to solve that problem in SVN 1.5
without making a science out of it.

I'd suggest starting with a simple converter implementation like this:

http://72.9.228.230/svn/jsvn/trunk/svnkit/src/org/tmatesoft/svn/core/internal/wc/SVNFileListUtil.java

If it is not enough, one can add special character mappings later.

--
Best regards,
Thomas Singer
_____________
SyntEvo GmbH
Brunnfeld 11
83404 Ainring
Germany
www.syntevo.com
Matthias Wächter wrote:
> On 23.07.2007 12:20, Erik Huelsmann wrote:
>> On 7/23/07, Matthias Wächter <matthias.waechter@tttech.com> wrote:
>>> On 21.07.2007 01:42, Daniel A. Steffen wrote:
>>> Right. Keepling a local 'matching table' between repository vs.
>>> local file names could also be a solution for Windows users that are
>>> busted with repositories containing file with the same name, once
>>> lower case, once upper case.
>> This won't help: in the light of network mounts/drives, you can't be
>> sure a drive on Windows is a Windows filesystem... You could be
>> writing to an HFS+ drive.
> 
> One can imagine a lot of--old or new--file systems that don't adhere
> to a 1:1 Unicode file name mapping for whatever reason. Any old
> iso-8859-x 8-bit file system doesn't support a (complete) 1:1 file
> name mapping for Unicode, Windows does not allow uppercase/lowercase
> equivalence, MAC normalizes in their own one-way, and what's behind
> the network is hard to get a clear impression of, automatically.
> 
> But there _are_ file systems that are tolerant to the Unicode file
> names and whether normalization is used or not. I don't see a good
> cause for Subversion to enforce a normalization to those.
> 
> Pragmatically, one could see Subversion as a 'file system' on its
> own. See it as a mounted 'network drive' you exchange data with
> every time you do an update, commit etc. If you must consider
> problems between this file system and your local representation, you
> have to use some glue inbetween for proper file name exchange and
> translation. And I think, this glue must be an option for the client
> which connects these two file systems. No-one should be forced to
> use normalization within Subversion if he doesn't have good cause
> for it.
> 
>> Treating this problem as a case-sensitivity issue is not really fair
>> to the problem: there are 2 file names which mean exactly the same
>> thing. While with case sensitivity users can actually *see* the
>> difference between path names, here, it's not the case. It is not
>> even *meant* to be the case: Unicode assigns the same meaning to
>> "u" + "last letter with umlaut" and "u with umlaut", it's only the
>> binary values that differ.
> 
> Asking whether NFC/NFD it is *meant* for being the same is more a
> matter of taste than required by the standard, AFAICS. Those are
> simply different representations of the same output on screen, and
> normalization like NFC/NFD allows comparison between them.
> 
> See one of my last posts about that. It's not only about
> normalization (NFC, NFD), as every Unicode string can be represented
> by _any number_ of different Unicode strings using nonprintable
> characters, direction reversing characters and so on. Just because
> most UI applications don't show mixed-direction strings in file
> names correctly doesn't mean that we have a good position ignoring
> that issue. Btw, these control characters and strings modified by
> them are _not_ covered by normalization at all.
> 
> Asking for Subversion to give equivalence for all Unicode file names
> that appear the same on screen is nonsense. Just to be sure that I'm
> correctly understood: I don't say that anyone has made this
> pragmatic request, but it's the logical consequence of this
> normalization debate.
> 
>> Subversion should compensate for that and treat
>> the different values to mean the same thing.
> 
> Subversion could allow an interface to client-side filename
> conversion (maybe plugins) that can be _manually_ selected if a user
> wants to have a specific translation scheme. And with every
> non-perfect translation scheme, there are cases that can lead to
> different repository file names resulting in the same local file
> name (and vice versa).
> 
>>> Then, one of these files could have a
>>> slightly different local file name, and both could be checked out,
>>> worked with etc.
>> Why would you want 2 files, one of which is called "&Auml;lter", the
>> other "A&uml;lter" and have them both be versioned?
> 
> I don't think that Subversion is in the position to argue about
> that. I am pretty sure the same discussion was held in the Linux
> file system team as well as the Windows file system team as well as
> the MAC OS file system team. And only the MAC OS guys agreed to let
> an automatic one-way file name normalization like NFD slip into
> their library, and that's why we are--more precisely, the MAC OS X
> users are--now in that mess.
> 
>> So, what happens if a Linux and a Mac client both add the same file,
>> with an A-umlaut in it and then they both commit? Currently, they'll
>> both be added to the repository. The Linux client will end up with 2
>> files with the same name, the Mac client will not be able to update
>> anymore.
> 
> I don't see these two files with the same looking file name be the
> issue. The issue is the Mac client (and its OS). Like the Windows
> client is the issue after a Linux user added "Alter" and "alter".
> 
>> As long as this is hardly visible on the outside, why souldn't
>> Subversion standardize on one or the other In its own little world
>> (internally)?
> 
> Because it is (a) imperfect as it only covers a very small fraction
> of same-looking-but-different Unicode file name issues and (b)
> normalization is a one-way function losing information. It is not a
> VCS's job to enforce something like that, but it may be an option
> for the client to allow some ways of conversion if requested.
> 
> - Matthias
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 23 19:55:02 2007

This is an archived mail posted to the Subversion Dev mailing list.