Re: [RFC] Non-normalizing Unicode Composition Awareness

From: Thomas Åkesson <thomas_at_akesson.cc>
Date: Fri, 9 Nov 2012 12:28:58 +0100

Revisiting this thread after a few months. Last spring, I did some work in the Wiki designing a proposal for resolving the Mac Unicode issues in a Non-normalizing manner. I ran out of time, but the thought process has been ongoing.

A couple of weeks ago at Subversion Live in London, I had the opportunity to discuss with a number of people. Although there were some different opinions on the matter, I think we concluded that we are actually relatively well aligned on the core idea.

The proposal I drafted this spring (in the Wiki) proposed that a couple of columns were added to the WC in order to store normalized paths. Since a couple of months the concept of using a Sqlite collation has seemed more appealing. Last week, I did a test with the Sqlite ICU extension (available in sqlite source repository) which turned out to be quite encouraging. With such a collation, it is possible to perform equals in SQL statements that match paths in a Unicode composition aware manner and therefore return rows regardless what composition the paths have.

This would be very useful, for instance, when given a filesystem path attempting to locate the corresponding node in wc.db. That is basically half the issue with Mac working copies.

Today, I noticed that Branko started some implementation in a branch. Looks like a collation based on utf8proc is in the making? I think that would make a lot of sense because the ICU extension poses some challenges in the build process and we might not need all that functionality that it provides.

I started a wiki page about unicode collation. I will append more info:
http://wiki.apache.org/subversion/UnicodeCollation

Also note the tiny test repo attached to:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

Cheers,
Thomas Å.

Received on 2012-11-09 12:29:36 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: Windows buildbot FAIL on 1.7.x"
Previous message: Daniel Shahaf: "Re: svn commit: r1407279 - in /subversion/trunk/subversion: svnadmin/main.c tests/cmdline/svntest/main.py"
Next in thread: Branko ÄŒibej: "Re: [RFC] Non-normalizing Unicode Composition Awareness"
Reply: Branko ÄŒibej: "Re: [RFC] Non-normalizing Unicode Composition Awareness"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]