Re: [RFC] Non-normalizing Unicode Composition Awareness
From: Thomas ┼kesson <thomas_at_akesson.cc>
Date: Fri, 9 Nov 2012 12:28:58 +0100
Revisiting this thread after a few months. Last spring, I did some work in the Wiki designing a proposal for resolving the Mac Unicode issues in a Non-normalizing manner. I ran out of time, but the thought process has been ongoing.
A couple of weeks ago at Subversion Live in London, I had the opportunity to discuss with a number of people. Although there were some different opinions on the matter, I think we concluded that we are actually relatively well aligned on the core idea.
The proposal I drafted this spring (in the Wiki) proposed that a couple of columns were added to the WC in order to store normalized paths. Since a couple of months the concept of using a Sqlite collation has seemed more appealing. Last week, I did a test with the Sqlite ICU extension (available in sqlite source repository) which turned out to be quite encouraging. With such a collation, it is possible to perform equals in SQL statements that match paths in a Unicode composition aware manner and therefore return rows regardless what composition the paths have.
This would be very useful, for instance, when given a filesystem path attempting to locate the corresponding node in wc.db. That is basically half the issue with Mac working copies.
Today, I noticed that Branko started some implementation in a branch. Looks like a collation based on utf8proc is in the making? I think that would make a lot of sense because the ICU extension poses some challenges in the build process and we might not need all that functionality that it provides.
I started a wiki page about unicode collation. I will append more info:
Also note the tiny test repo attached to:
This is an archived mail posted to the Subversion Dev mailing list.