Re: UTF-8 problem: non-UTF-8 in a UTF-8 locale

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2004-02-06 16:11:38 CET

Florian Weimer <fw@deneb.enyo.de> writes:

> Philip Martin wrote:
>
>> UTF-8 is defined by RFC2279, but it appears the GNU iconv uses the
>> more restrictive rules defined by Unicode, such as found in section
>> 3.9 of http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
>
> RFC 2279 has been superseded by RFC 3629, which contains basically the
> same rules.
>
> I agree that such checks are necessary to prevent repository corruption.
> I'm not sure if your checks are sufficient, though; do you handle
> surrogate pairs and other invalid UTF-8 sequences, too (apart from
> overlong UTF-8 sequences)?

My code implement the "well-formed UTF-8 bytes sequence" rules defined
by Unicode. These are the same as the rules in section 4 of RFC 3629,
and that section of the RFC states "The authoritative definition of
UTF-8 is in [UNICODE]". My code will detect anything that doesn't
conform to these rules, that includes overlong sequences, incomplete
sequences and non-minimal sequences.

-- 
Philip Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Fri Feb 6 17:12:40 2004

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: svn commit: propchange - r6000 svn:log"
Previous message: C. Michael Pilato: "Re: svn commit: r8562 - trunk/subversion/svnadmin"
In reply to: Florian Weimer: "Re: UTF-8 problem: non-UTF-8 in a UTF-8 locale"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]