[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Unicode byte-order mark (BOM)

From: Philip Martin <philip_at_codematters.co.uk>
Date: 2004-03-06 23:52:14 CET

Adal Chiriliuc <adal@myrealbox.com> writes:

> EF BB BF - UTF-8

Subversion could treat UTF-8 as text, but I'm not so sure about those
below.

> FE FF - UTF-16/UCS-2, little endian
> FF FE - UTF-16/UCS-2, big endian
> FF FE 00 00 - UTF-32/UCS-4, little endian
> 00 00 FE FF - UTF-32/UCS-4, big-endian

The problem is that Subversion's internal 3-way merge treats files as
byte streams and splits lines on a \n byte. If such a byte occurs
anywhere other than the last byte of a multi-byte character the result
could be an invalid file. Unless the internal diff library is made
multi-byte aware then these encodings need to be treated as binary.
Note: UTF-8 doesn't have this problem, it is safe for Subversion to
treat it as text.

Is there an external diff3 program that handles these multi-byte
encodings?

-- 
Philip Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Mar 6 23:52:33 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.