Re: Unversioned files with invalid UTF-8 sequence in name confuse svn

From: Vincent Lefevre <vincent-svn_at_vinc17.net>
Date: Mon, 29 Feb 2016 20:45:14 +0100

On 2016-02-29 19:57:04 +0100, Branko ÄŒibej wrote:
> On 29.02.2016 19:30, Vincent Lefevre wrote:
> > On 2016-02-29 17:00:01 +0100, Bert Huijben wrote:
> >> The problem is most likely not that they have an invalid utf-8 sequence in
> >> their name, but that your settings report that filenames are encoded in one
> >> way, while there is a file which name can't be expressed by that format.
> >>
> >> You get this error when Subversion isn't able to convert the filename to its
> >> internal utf-8 format, which should be capable to express any valid
> >> filename. (If you declare that all filenames are utf-8, there wouldn't be a
> >> conversion, so in most cases not an error)
> >>
> >> To just handle it as unversioned as you suggest we need to at least be able
> >> to express its name.
> > There are two ways to express a filename:
> > 1. The only from the OS (e.g., in POSIX, this is just a sequence
> > of bytes).
>
> This isn't entirely correct. It's true as far as most (but certainly not
> all) filesystem implementations are concerned; but applications expect
> to interpret those bytes in the context of the active locale.

Not all applications. Most command-line utilities run fine without
having to interpret those bytes (which is very useful, in particular
for "rm"). The point is that they do not need to interpret them for
what they are required to do.

> > 2. The one used by Subversion internally.
> >
> > (2) is necessary for versioned files, but for unversioned files,
> > you do not need to do the (1) -> (2) conversion.
>
> Sure you do. How else are you going to know that the file is
> unversioned? (The working copy database stores paths encoded as UTF-8.)

Well, you need to do the (1) -> (2) conversion only to test whether
the file is versioned or not. But if the (1) -> (2) conversion fails,
this means that the file is unversioned.

> > The problem is that it is too easy to create files with a name using
> > invalid UTF-8 sequences
>
> File names on disk DO NOT have to be represented in UTF-8. They do have
> to be represented in consistently with the current locale settings.

which must in practice be UTF-8. Otherwise one gets failures sooner
or later.

> A fairly plausible cause for getting the wrong representation is
> changing the locale for the duration of a script invocation. Another
> plausible way is to create files based on the contents of some script,
> which are not encoded the as expected by the current locale.

However Subversion doesn't handle that (BTW it would be much better
to remember the expected locale by storing it in the .svn directory
rather than giving obscure error messages: if it did, Subversion
would know that the user was using an incorrect locale without any
ambiguity).

> > (in my case, it seems just to be due to a bug in Automake or Libtool).
>
> Or the way you're using them, perhaps?

I've eventually found that this is a bug in dash, which reexecutes
a command for a foreign architecture as a shell script instead of
giving an exec format error like the other shells.

> > But the user should not be required to find them and delete manually.
>
> It's also too easy to ignore (or delete) files because someone managed
> to misconfigure their locale.

Currently you can't avoid the problem: if the user has used UTF-8
then runs Subversion under ISO-8859-1 locales, the "misconfiguration"
is not detected, and "svn up" can yield corrupt a working copy as
shown in the past. Subversion should remember the locale that was
used initially to avoid such a problem.

> I'd really, really strongly suggest not to make such a thing the
> default in Subversion.

Then fix Subversion.

-- 
Vincent LefÃ¨vre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Received on 2016-02-29 20:45:27 CET

This message: [ Message body ]
Next message: Branko ÄŒibej: "Re: Unversioned files with invalid UTF-8 sequence in name confuse svn"
Previous message: Stefan Sperling: "Re: Unversioned files with invalid UTF-8 sequence in name confuse svn"
In reply to: Branko ÄŒibej: "Re: Unversioned files with invalid UTF-8 sequence in name confuse svn"
Next in thread: Branko ÄŒibej: "Re: Unversioned files with invalid UTF-8 sequence in name confuse svn"
Reply: Branko ÄŒibej: "Re: Unversioned files with invalid UTF-8 sequence in name confuse svn"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]