[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 NFC/NFD paths issue

From: Greg Stein <gstein_at_gmail.com>
Date: Thu, 16 Sep 2010 00:59:59 -0400

On Wed, Sep 15, 2010 at 23:35, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> Erik Huelsmann wrote on Wed, Sep 15, 2010 at 23:20:06 +0200:
>> Yesterday, I was talking to CMike about our long-standing issue with UTF-8
>> strings designating a certain path not neccessarily being equal to other
>> strings designating the same path. The issue has to do with NFC (composed)
>> and NFD (decomposed) representation of Unicode characters. CMike nicely
>> called the issue the "Erik Huelsmann issue" yesterday :-)
>> The issue consists of two parts:
>>  1. The repository which should determine that paths being added by a commit
>> are unique, regardless of their encoding (NFC/NFD)
> Will you assume that all paths in the repository have been
> Unicode-canonicalized prior to entering the repository?
> If yes, then we infer that no two in-repository paths (which are
> bytewise different) canonicalize to the same byte sequence.  Which is
> pretty useful precondition to have, i.e., what /can/ svn do on a legacy
> repository where some two paths are bytewise-different yet Unicode-equal?

This will be *very* difficult to manage. Even if a given repository
somehow manages to rewrite history to "fix" the paths, then you may
have an unknown number of downstream synchronized repositories to
similarly fix.

I think an answer might be to rely on the upcoming obliterate
feature's "out of band" change descriptions. For example, a repository
might tell a working copy, "hey: file XYZ was obliterated since you
last talked to me. if you happen to have it, then get rid of it. I
won't recognize it henceforth." You can see a similar descriptor sent
to working copies or repositories that says "I recoded XYZ. update to
the new encoding."

These change descriptors are effectively annotations that occur
outside the standard revision history of a repository. We could use
them to transmit path-encoding changes, along with obliteration

Received on 2010-09-16 07:00:36 CEST

This is an archived mail posted to the Subversion Dev mailing list.