[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 NFC/NFD paths issue

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: Mon, 20 Sep 2010 16:38:33 +0200

Sorry to have left the discussion running so long without contributing to it
myself. The reason I started about changing the repository / fs is because
it is where we store the dataset that we'll need to support forever: working
copies get destroyed and checked out over and over every hour, every day.
Repositories get created once and only accumulate data.

> > That doesn't solve the historical revisions containing "bad" paths. My
> > understanding of the problem was that we'd go into the past and
> > rewrite the paths into a single, canonical form.
> >
> Agreed: an out-of-band solution fixes thing historically too.

As pointed out on IRC, I think it's important to stop adding semantically
the same paths to a repository. From the perspective of efficiency, it might
be handy to have a normalized version stored somewhere for all paths living
in the repository, but to prevent addition of differently encoded paths,
such a thing isn't really required: the correct encoding can be calculated
when the check happens.

> Having backend enforce NFC can wait for 2.0 I suppose :)

True, but the value of that might be limited: if we required all
communications to be NFC encoded, we need to take additional measures - as
pointed out by Branko - to make things work on MacOS X: currently, we have
MacOS X shops happily working with non-ascii characters in the paths, all
NFD encoded. That would change.

By the way, Julian Foad, Philip Martin, Bert Huijben and I talked through a
possible solution to fix the client-side issue which becomes an option once
we switch to wc-ng. The full impact of that change needs to be determined
though and probably does not fit in the 1.7 timeline. If it seems it does,
we'll bring it up.

To recap, the change I'm proposing is that we check pathnames with NFC/D
aware comparison routines upon add_file() / add_directory() inside
libsvn_repos or libsvn_fs_* - of which I suspect it's easier to handle
inside the latter. In my proposal, we don't specify a "repository normal"
encoding. If performance degrades too much, we can enhance the filesystem
with a normalized version which doesn't need to be recoded in order to do
the comparison with the incoming path.

Other than that, I don't think there's anything *required* to make us
Unicode-aware on the server. It's also the change I'm proposing cmpilato to
implement in libsvn_fs_base as a proof of concept.

This proposal says nothing about the client side. The client side can be
fixed independently from the server side, given that we can't switch to
normalized paths in the protocol until 2.0: whatever paths a server sends,
the client will need to use those to communicate back to the server.


Received on 2010-09-20 16:39:14 CEST

This is an archived mail posted to the Subversion Dev mailing list.