[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Unicode byte-order mark (BOM)

From: Adal Chiriliuc <adal_at_myrealbox.com>
Date: 2004-03-07 00:04:20 CET

On Sunday, March 7, 2004 Philip Martin wrote:
> Adal Chiriliuc <adal@myrealbox.com> writes:

>> EF BB BF - UTF-8

> Subversion could treat UTF-8 as text, but I'm not so sure about those
> below.

>> FE FF - UTF-16/UCS-2, little endian
>> FF FE - UTF-16/UCS-2, big endian
>> FF FE 00 00 - UTF-32/UCS-4, little endian
>> 00 00 FE FF - UTF-32/UCS-4, big-endian

> The problem is that Subversion's internal 3-way merge treats files as
> byte streams and splits lines on a \n byte. If such a byte occurs
> anywhere other than the last byte of a multi-byte character the result
> could be an invalid file. Unless the internal diff library is made
> multi-byte aware then these encodings need to be treated as binary.
> Note: UTF-8 doesn't have this problem, it is safe for Subversion to
> treat it as text.

I can't reproduce the problem right now, but I know what I did. I've
looked in the test repository for the original files with which I
tested, and the file without the bytemark which I said was not marked
as binary is corrupted. It has lineends (CRLF) after every char and is
not Unicode anymore!

I try now to remember exactly what I did.

> Is there an external diff3 program that handles these multi-byte
> encodings?

I have no idea :) I started to use version control a week ago!

PS: I'm subscribed, don't Reply to all.

Adal Chiriliuc

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sun Mar 7 00:05:49 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.