[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: UTF-8 problem: non-UTF-8 in a UTF-8 locale

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2004-02-06 16:11:38 CET

Florian Weimer <fw@deneb.enyo.de> writes:

> Philip Martin wrote:
>
>> UTF-8 is defined by RFC2279, but it appears the GNU iconv uses the
>> more restrictive rules defined by Unicode, such as found in section
>> 3.9 of http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html
>
> RFC 2279 has been superseded by RFC 3629, which contains basically the
> same rules.
>
> I agree that such checks are necessary to prevent repository corruption.
> I'm not sure if your checks are sufficient, though; do you handle
> surrogate pairs and other invalid UTF-8 sequences, too (apart from
> overlong UTF-8 sequences)?

My code implement the "well-formed UTF-8 bytes sequence" rules defined
by Unicode. These are the same as the rules in section 4 of RFC 3629,
and that section of the RFC states "The authoritative definition of
UTF-8 is in [UNICODE]". My code will detect anything that doesn't
conform to these rules, that includes overlong sequences, incomplete
sequences and non-minimal sequences.

-- 
Philip Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Feb 6 17:12:40 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.