Unicode byte-order mark (BOM)

From: Adal Chiriliuc <adal_at_myrealbox.com>
Date: 2004-03-06 23:04:40 CET

There is a problem in the binary/text detector from Subversion 1.0.0 (Win32).
The Unicode standard defines a so called byte-order mark. This is usually
placed at the begining of a Unicode plain text file. This marker can
have these representations:

EF BB BF - UTF-8
FE FF - UTF-16/UCS-2, little endian
FF FE - UTF-16/UCS-2, big endian
FF FE 00 00 - UTF-32/UCS-4, little endian
00 00 FE FF - UTF-32/UCS-4, big-endian

When you save a plain text file as Unicode from Notepad (Windows XP)
it adds this mark at the beginning of the file. But then if you add
that file to a Subversion repository, it's marked as
application/octet-stream. If you remove the byte-order mark and add it
again (under a different name, of course), it doesn't mark it as
application/octet-stream.

More info and some ideas on how to determine if a file is Unicode:
http://msdn.microsoft.com/library/en-us/intl/unicode_42jv.asp
http://msdn.microsoft.com/library/en-us/intl/unicode_81np.asp

Adal Chiriliuc

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Mar 6 23:06:21 2004

This message: [ Message body ]
Next message: Erik Huelsmann: "Re: svn commit: r8909 - trunk/subversion/libsvn_diff"
Previous message: Greg Hudson: "Re: RFC: unused files"
Next in thread: Greg Hudson: "Re: Unicode byte-order mark (BOM)"
Reply: Greg Hudson: "Re: Unicode byte-order mark (BOM)"
Reply: Philip Martin: "Re: Unicode byte-order mark (BOM)"
Maybe reply: Adal Chiriliuc: "Re: Unicode byte-order mark (BOM)"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]