[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Let's discuss about unicode compositions for filenames!

From: Peter Samuelson <peter_at_p12n.org>
Date: Tue, 31 Jan 2012 10:36:11 -0600

[reordering the conversation flow slightly]

  [Peter Samuelson]
> > That's the implementation I would like to see, to be honest. Start
> > with the observation that we can treat Mac OS X NFD paths as a
> > client character encoding. Now observe that it is lossy. But
> > ... almost all non-Unicode client charsets are equally lossy, for
> > exactly the same reason!

[Branko Cibej]
> I don't see what you mean by "lossy" though. NFD and NFC can
> represent exactly the same set of characters, it's just that the
> representations of some of them are different.

By "lossy" I just mean that if you convert to UTF-8 NFD, you can't
reliably convert _back_ to the original bytes. I'm assuming here that
we continue to do _no_ n11n on the server side - pathnames from
libsvn_(ra|repos|fs) are just UTF-8 with unspecified n11n. Thus, if
the "client encoding" is UTF-8 NFD, you can't reliably convert that to
the "server encoding".

And this is also true of most legacy (non-Unicode) encodings: they know
nothing about Unicode's n11n forms, so they are "lossy" in the same
way: you can't reliably take a pathname in, e.g., ISO-8859-1, and
convert to the encoding found in the repository, because you don't know
the n11n form used by the original committer.

This is why I suggested the mapping table in wc.db.

Actually, the fact that the mapping table works around the inherent
lossiness of character encoding conversion suggests that it _could_, in
the future, also account for lossiness for other reasons. If we
wished, we could have libsvn_wc mangle checked-out filenames on
platforms with arbitrary limitations - escaping "<" and ":" characters
on Windows, e.g. - using this same mechanism. Even if the conversion
is lossy, the mapping table in wc.db knows the original filename. Of
course you couldn't _create_ filenames with platform limitations on the
same platform, but being able to check out the file at all is an
improvement over today. Probably 'svn status' would show some
indication that a name has been mangled in a way users would actually
care about (i.e., not just NFC/NFD).

> > The implementation on OS X might be a bit hairy, if there isn't an
> > easy way to retrieve the real pathname of the file you just
> > created. Anywhere else, we just store the pathname we just
> > calcuated.

> Afaik the OSX API normalizes everything to NFD automagically. So at
> least on that platform there's no chance of having more than one form
> for the same filename at the same time. Likewise on Windows, which
> normalizes to NFC.

Right. The question is, if libsvn_wc tells OS X to store a given path,
with unknown n11n, is there an easy way to retrieve the pathname that
was _actually_ stored on disk? That's what I mean by "might be a bit
hairy". It sounds like the thing to do on OS X is for libsvn_wc to
pre-normalize to NFD before writing the file, and just assume the OS
will (re-)normalize to the same byte array.

-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/
Received on 2012-01-31 17:36:53 CET

This is an archived mail posted to the Subversion Dev mailing list.