[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Evil UTF-8 Character in filename in repo causing issues on my wc

From: Stefan Sperling <stsp_at_elego.de>
Date: Wed, 15 Jun 2011 12:29:37 +0200

On Wed, Jun 15, 2011 at 01:39:30AM -0500, Ryan Schmidt wrote:
> I would clarify this by saying the problem is that Subversion assumes
> that a filename submitted in one version of UTF-8 encoding will always
> stay in that version of UTF-8 encoding, and on the HFS+ filesystem,
> used by Mac OS X, that assumption is not necessarily true. (It
> normalizes all UTF-8 filenames to decomposed form.) Subversion would
> happily allow you to create two filenames that humans would consider
> identical (one with UTF-8 entities composed, one with UTF-8 entities
> decomposed). So clearly that's a bug in Subversion (or possibly apr or
> apr-util); it should normalize UTF-8 strings before running
> comparisons. It also seems like a bug in Windows and Linux
> filesystems; I assume they also let you create multiple files whose
> names look identical (but differ only in the composition of their
> UTF-8 characters). Mac OS X's is the only filesystem I know of that
> has fixed this bug -- which therefore exposes the problem when
> collaborating between Mac OS X systems (which have the fix) and other
> systems (which do not).

Traditionally there was no encoding information associated with filenames
on UNIX systems. The OS was supposed to store the filename under whatever
name the application passes in. This of course stems from the fact that
the only encoding on original UNIX was ASCII, so there was no problem with
this approach back then.

Unicode, and it's quirk of allowing the *same* character to be encoded
in *different* ways, came much later.

I think it is unfortunate that Apple broke with the concept that a
filename is just a string of bytes.
When they made this decision they probably considered that it might break
applications and decided that the applications would have to adjust.
But that is very, very hard for applications like Subversion which
need to guarantee backwards compatibility to a point where individual
bytes matter.

So what if two filenames looks identical to the user?
As long as nobody was changing the underlying byte string things were
working just work fine.

However, I also agree that we would be in a much better spot now if
Subversion had been normalising UTF-8 strings from the start.
This was an oversight made when the project started out.
But I doubt Subversion is the only project that missed these subtle
details of the unicode standard. From a software engineer's perspective,
it is a *very* unnatural for an encoding standard to contain ambiguous
representations of the same data. So I would not outright blame folks
for this oversight and call it a "bug" in the application or the OS.
There are many ways to point fingers here, including the standard committee.
Received on 2011-06-15 12:30:19 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.