Re: [RFC] Non-normalizing Unicode Composition Awareness

From: Thomas Åkesson <thomas_at_akesson.cc>
Date: Mon, 23 Apr 2012 16:11:49 +0200

Hi Philip,

Thanks for your comments in the wiki article. They raised some important points and potentially an idea that might simplify the solution.

> All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access the filesystem.

This is not unlike what we need to do for HFS+. We could consider UTF8-MAC to be a distinct encoding. There is the major caveat that this conversion is irreversible (since the normalization is not specified in the repo/wc.db).

If you, or someone else with WC insight, could provide some details on when/how conversions in the opposite direction is performed (e.g. svn stat and most commands taking path arguments), that would be incredibly useful to me. I would like to explore the option to somehow work around the "irreversible problem".

It would also be useful if someone could point me to where in the WC code the conversion from UTF-8 to locale encoding is performed.

Thanks!

/Thomas Å.

On 17 apr 2012, at 05:24, Thomas Åkesson wrote:

> Hi,
> A bit of a status update on the wiki article:
> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
>
> Received some comments from Daniel, which I have tried to address. Thanks.
>
> I have written a bash script which demonstrates the concept of "Alternative 1" with regards to how the local_relpath column is handled by checkout/update.
>
> From the wiki:
> ---
> This alternative can be simulated using the attached script localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout should produce if this alternative was implemented in Subversion itself:
>
> svn co ...
> svn stat #Shows any problematic items as missing and unversioned
> localrelpath2nfd.sh
> svn stat #Should be clean apart from misperception that some items are switched
> ---
>
> This script can be used to investigate how other subcommands are affected and determine what needs to be done. It is possible to make commits but updates to normalisation-dependent nodes will fail since this script is not inside the update code.
>
> I intend to use this script to take the design to the next level of detail. First, I would like some feedback from people with in-depth knowledge of the WC and preferably get some idea on what the community thinks about the approach.
>
> /Thomas Å.
>
>
> On 26 mar 2012, at 04:14, Thomas Åkesson <thomas_at_akesson.cc> wrote:
>
>> Hi,
>> Sorry about the delay, had a release to sort out...
>>
>> I have moved the proposal into the wiki:
>> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
>>
>> The comments from Julian and Markus have been implemented and I have added more information to the "Client Changes" section as well as more structure and TODO-notes.
>>
>> I would really appreciate if someone with more insight into WC-NG could provide input on some of the TODO items (or things that have been completely overlooked).
>>
>> Thanks,
>> Thomas Å.
>>
>>
>> On 21 feb 2012, at 09:55, Daniel Shahaf wrote:
>>
>>> I've granted you write access to the wiki.
>>>
>>> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
>>>> Thanks Julian and Markus for providing feedback.
>>>>
>>>> I am not commenting below because all the feedback is very good and I will try to address it as best I can in the next iteration. Describing the behaviour changes to the WC is the most challenging since I lack that kind of detailed knowledge. I will instead try to draft the structure of that section to make it easier for someone with that level of detail to assist.
>>>>
>>>> Regarding use cases, what can I say... it was towards the end of a long stretch.
>>>>
>>>> I think it would help with the upcoming iterations if I could move this "document" into the wiki. If you find that this first draft shows promise, please consider granting edit access in the wiki. My user name is "Thomas Åkesson", which exercises the Unicode awareness of MoinMoin...
>>>>
>>>> /Thomas Å.
>>>>
>>>>
>>>> On 14 feb 2012, at 11:25, Julian Foad wrote:
>>>>
>>>>> Hi Thomas. It's fantastic that you're taking the trouble to write up this proposal. That's just what we need. Just a few initial comments below...
>>>>>
>>>>> Thomas Åkesson wrote:
>>>>>
>>>>>> Context
>>>>>> ===
>>>>>>
>>>>>> [...] A unicode string (e.g. a file name) can be represented
>>>>>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
>>>>>> characters where some are composed and others decomposed (rare).
>>>>>
>>>>>
>>>>> What's "rare"? We have to assume that input is in mixed composition in any system that doesn't explicitly normalize it, which (I think) includes most operating systems. While it may be rare for any single string to contain characters in both compositions, it is very common to be processing a string that *might* have characters in both compositions -- in other words, that is not guaranteed to be normalized. I think it would be clearer to drop the "(rare)" and just say "... normalized forms (NFC/NFD) or mixed (not normalized).".
>>>>>
>>>>>
>>>>>> A minority of file systems (currently Mac OS X HFS+ only) will
>>>>>> normalize the paths. In the case of HFS+, the path will be
>>>>>> normalized into NFD and it will even be given back that way when
>>>>>> listing the filesystem.
>>>>>
>>>>>
>>>>> Drop the word "even"? The statement is not surprising.
>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> Similarities to case-sensitivity
>>>>>> ===
>>>>>>
>>>>>> - If two Unicode strings differ only by letter case/composition,
>>>>>
>>>>> Drop "/composition" -- it's the subject of the following sentence.
>>>>>
>>>>>> on some
>>>>> computer systems they refer to the same file, while on
>>>>>> other systems
>>>>> they refer to different files. The same applies
>>>>>> if two Unicode strings
>>>>> differ only by composition.
>>>>>
>>>>>
>>>>>> [...]
>>>>>
>>>>>> Client Changes
>>>>>> ===
>>>>>>
>>>>>> [...] An abstraction between the repository path and the file
>>>>>> system path can be achieved by ensuring that there is a column
>>>>>> in wc.db that contains the file system path in exactly the same
>>>>>> form that the file system gives back. APIs in wc needs to be
>>>>>> extended to ensure that all interaction with the file system is
>>>>>> performed with the file system path.
>>>>>
>>>>> [...]
>>>>>
>>>>> This part seems to be the heart of the whole proposal. You describe the data that we need, but the behaviour will also need to be described in detail. Presumably much of the behaviour is boring and obvious (when we check out a new path and create it on disk, we store the disk path), but I'm sure there will be some less obvious parts (do we need to find out what the disk path of an 'excluded' node would be, even though we're not actually creating it on disk, for example).
>>>>>
>>>>>
>>>>>> Use Cases
>>>>>> ===
>>>>>>
>>>>>> This change will only affect use cases which rely on creating
>>>>>> paths that look like duplicates but use different unicode
>>>>>> composition. It is highly unlikely anyone is relying on this..
>>>>>
>>>>>
>>>>> Uh... it sounds like you are saying there are no interesting use cases for this proposal! No, on the contrary, this proposal also affects checking out and using a WC on Mac HFS+ where the repository paths were created on another system and are not in NFD, and it allows that case to work. That's the more interesting use case, is it not? It's definitely worth writing out the interesting case in full, including steps like checkout (or update) that brings in a non-NFD path, create a new file on the Mac, and commit.
>>>>>
>>>>> - Julian
>>>>>
>>>>
>>
>
Received on 2012-04-23 16:12:28 CEST

This message: [ Message body ]
Next message: Philip Martin: "Re: [RFC] Non-normalizing Unicode Composition Awareness"
Previous message: Hyrum K Wright: "Re: [RFC][PATCH 00/22] JavaHL Ra API Implementation"
In reply to: Thomas Ã…kesson: "Re: [RFC] Non-normalizing Unicode Composition Awareness"
Next in thread: Philip Martin: "Re: [RFC] Non-normalizing Unicode Composition Awareness"
Reply: Philip Martin: "Re: [RFC] Non-normalizing Unicode Composition Awareness"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]