[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Unversioned files with invalid UTF-8 sequence in name confuse svn

From: Markus Schaber <m.schaber_at_codesys.com>
Date: Tue, 1 Mar 2016 14:07:19 +0000

Hi, Brane and Vincent,

From: Branko Čibej [mailto:brane_at_apache.org]
> >> A fairly plausible cause for getting the wrong representation is
> >> changing the locale for the duration of a script invocation. Another
> >> plausible way is to create files based on the contents of some
> >> script, which are not encoded the as expected by the current locale.
> > However Subversion doesn't handle that (BTW it would be much better to
> > remember the expected locale by storing it in the .svn directory
> > rather than giving obscure error messages: if it did, Subversion would
> > know that the user was using an incorrect locale without any
> > ambiguity).
>
> And if the user changes the locale for valid reasons, the Subversion
> working copy would break in a different way.

I guess we would need some "change locale" operation, which would at least update the saved locale in the .svn directory.

(Updating the actual on-disk filenames could be left to the tools the user uses to also update his other filenames...)

> > Currently you can't avoid the problem: if the user has used UTF-8 then
> > runs Subversion under ISO-8859-1 locales, the "misconfiguration"
> > is not detected, and "svn up" can yield corrupt a working copy as
> > shown in the past. Subversion should remember the locale that was used
> > initially to avoid such a problem.
>
> Well? This issue isn't limited to Subversion; most applications with fail
> at some point once you start playing games with the locale and/or filename
> encoding. That's why both Windows and OS X mandate one of the Unicode
> representations for filenames.

Python actually adopted a workaround to this problem called "surrogate escaping".
https://www.python.org/dev/peps/pep-0383/

This mechanism is applied to filenames and similar "byte strings" during communication with the outer world, with the limitation that their purpose is just to transfer the contents of the 8 bit string from one OS interface to the other, with only limited interpretation or processing of them.

Basically, they encapsulate invalid bytes (which cannot be successfully transformed to the internal Unicode representation) to a lonely surrogate, and decode it back to the original byte on the output side.

A solution like this could help SVN to deal with miscoded filenames, and would allow e. G. an "svn rm" or "svn mv" etc.

When adopting such a solution, it should be strictly restricted to local filenames (the RA layers should refuse them), and I guess we could get away with not even allowing them to enter the local working copy database.

For screen output, we could translate them to escape sequences like \x1A, so "svn status" could work...

However, I'm not sure whether it's worth the work to support basically broken environments, but on the other hand, the Python guys did go that way.

> You might as well say that Unix (Linux) is broken and should be fixed (with
> which I'd heartily agree, but that's water under the bridge).

All recent Linux installations I saw had UTF-8 as their encoding (independent of the language / country settings actually in use). And I don't see any valid reason to use anything else nowadays, except for keeping compatibility with existing installations...

Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.schaber@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.
Received on 2016-03-01 15:09:15 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.