On 1/23/2012 10:38 AM, Philip Martin wrote:
> Garret Wilson<garret_at_globalmentor.com> writes:
>
>> On 1/23/2012 9:55 AM, Philip Martin wrote:
>>> I thought you were proposing to write the code?
>> I'm fine with that as well. Looks like I would have to add a few lines
>> to decote UTF-8 (surely such code already exists in the Subversion
>> codebase somewhere) and change a few if(...){} statements. I should be
>> able to handle that. I would imagine it will take more effort on my
>> part to get permission to change the code than actually writing the
>> code itself.
> The function receives a string of bytes, I think it's already in UTF-8.
> The problem is that while Subversion has functions to validate UTF-8 it
> doesn't have a system for extracting individual UTF-8 code points. At
> present it only ever needs to extract the ASCII subset which is trivial.
Ah. Well, like I said---I would be happy to write the UTF-8 extraction
code. It would be worth it to me to get this functionality in; it would
be a fun exercise for me; it would be a good introduction to the
codebase for me; it's a small (very small), low-risk task; and the
Subversion codebase would be better off in the end. (I'm sure it can be
used elsewhere.) It's a win-win for everyone! :D
This is really a small thing. Here's an example in just a few lines:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Or see DecodeUTF8BytesToChar at
tidy.sourceforge.net/cgi-bin/lxr/source/src/utf8.c .
I would be happy even precluding code points from supplementary planes
(e.g. those over U+FFFF), if anyone is worried about the code being too
complicated.
G
Received on 2012-01-23 19:58:00 CET