[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

AW: Let's discuss about unicode compositions for filenames!

From: Markus Schaber <m.schaber_at_3s-software.com>
Date: Mon, 30 Jan 2012 16:05:09 +0000

Hi, Peter,

Von: Peter Samuelson [mailto:peter_at_p12n.org]
>> [Stefan Sperling]
>> > We could also open the parent directory, read all the filenames
>> > within it, normalise them all, and then search the resulting list.
>> > This works, expect if a name exists twice, once in NFC form and once
>> > in NFD form. We'd somehow have to solve the name collision in the
>> > filesystem.

>[Markus Schaber]
> This sounds astonishingly similar to the lower/upper case problem of
> UN*X vs. Mac/Win.

> There are similarities, but there are some important differences:

>- We have to support Mac OS X, which stores all files in NFD. In the
> upper/lowercase analogy, think of OS X as MS-DOS, which does not
> preserve mixed case at all but always represents files in uppercase.
> Subversion doesn't support MS-DOS and I hope we never need to. MS
> Windows, OTOH, at least preserves the upper/lowercase distinction
> presented to it when you create a file. Big difference.

The preservation of cases does not help that much - a simple "map all to lower case when accessing the working copy, and search case insensitive in the database" could solve that problem - but there's the problem that the repository can contain files whose filename differs only in case, and then the preserving of original case does not help that much either.

>- Also, the Subversion platform has chosen to support files like README
> and Readme that conflict on Windows. Our reasoning is "if you have
> users on Windows, don't do that." Most solutions to the NFC/NFD
> problem will affect all platforms, not just one, and we probably
> can't just say "well, don't do that" - we'll need to actually prevent
> it - and somehow deal with existing clients, WCs, and repositories).

> Because of those differences, my gut feeling is that we can't treat the two issues in the same way.

There seem to be clients which allow files whose name differs only by encoding. So the position of "unicode encoding collisions" could be the same than on "case insensitivity collisions " (allow in the repository what the most capable clients allow). My guess is that the fixes for that scenario are rather similar (mainly client-based, specific to the capabilities of the platform, and "if you have users on mac, don't do that"). Of course Mac clients internally need to map to their normalized encoding in a similar way as it is done for case sensitivity now, and in case of encoding collisions, they've lost (similar to case collisions on Mac and Windows).

If the position is to disallow files whose name only differs by encoding in the repositories, things are a little bit different.

But I think that even this can be solved purely on the client, by only sending normalized names to the server for all new objects (imports, additions, copy targets, ...), and using the existing encodings for all existing objects.

For existing collisions, which harm work on MacOS, the usual workarounds apply: Rename the colliding files via repo-browser or in a more capable client. Additionally, we could develop a dump filter tool for name normalization, maybe with a switch whether to error out or silently rename on collisions.

With proper documentation, this will cause the problem to fade out in the future, and - in theory - it can be implemented on top of the first one at a later time. I don't see any need to change anything on the server (both implicit conversion and rejection of invalid encodings would break existing clients and working copies). My personal guess is that actual encoding collisions are rather rare, and workarounds exist, so servers can start to reject invalid encodings with version 2.0, or whatever future version is allowed to break compatibility to old clients.

Best regards

Markus Schaber

We software Automation.
3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50
Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 
Received on 2012-01-30 17:05:49 CET

This is an archived mail posted to the Subversion Dev mailing list.