[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 NFC/NFD paths issue

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Fri, 17 Sep 2010 01:26:14 +0200

Greg Stein wrote on Thu, Sep 16, 2010 at 00:59:59 -0400:
> On Wed, Sep 15, 2010 at 23:35, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> > Erik Huelsmann wrote on Wed, Sep 15, 2010 at 23:20:06 +0200:
> >> Yesterday, I was talking to CMike about our long-standing issue with UTF-8
> >> strings designating a certain path not neccessarily being equal to other
> >> strings designating the same path. The issue has to do with NFC (composed)
> >> and NFD (decomposed) representation of Unicode characters. CMike nicely
> >> called the issue the "Erik Huelsmann issue" yesterday :-)
> >>
> >> The issue consists of two parts:
> >> 1. The repository which should determine that paths being added by a commit
> >> are unique, regardless of their encoding (NFC/NFD)
> >
> > Will you assume that all paths in the repository have been
> > Unicode-canonicalized prior to entering the repository?
> >
> > If yes, then we infer that no two in-repository paths (which are
> > bytewise different) canonicalize to the same byte sequence. Which is
> > pretty useful precondition to have, i.e., what /can/ svn do on a legacy
> > repository where some two paths are bytewise-different yet Unicode-equal?
(I assume you're replying to my second paragraph)
> This will be *very* difficult to manage. Even if a given repository
> somehow manages to rewrite history to "fix" the paths, then you may
> have an unknown number of downstream synchronized repositories to
> similarly fix.
> I think an answer might be to rely on the upcoming obliterate
> feature's "out of band" change descriptions. For example, a repository
> might tell a working copy, "hey: file XYZ was obliterated since you
> last talked to me. if you happen to have it, then get rid of it. I
> won't recognize it henceforth." You can see a similar descriptor sent
> to working copies or repositories that says "I recoded XYZ. update to
> the new encoding."

I don't see why this needs to be special-cased? The server can simply
send "rename(NFD(), NFC())" and the wc library can figure for itself
that it's inoperative for her in the same place she determines that
"rename('foo','FOO')" is inoperative for her (when the filesystem is

> These change descriptors are effectively annotations that occur
> outside the standard revision history of a repository. We could use
> them to transmit path-encoding changes, along with obliteration
> notices.
> Cheers,
> -g
Received on 2010-09-17 00:27:46 CEST

This is an archived mail posted to the Subversion Dev mailing list.