[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Unicode byte-order mark (BOM)

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2004-03-06 23:14:21 CET

On Sat, 2004-03-06 at 17:04, Adal Chiriliuc wrote:
> EF BB BF - UTF-8
> FE FF - UTF-16/UCS-2, little endian
> FF FE - UTF-16/UCS-2, big endian
> FF FE 00 00 - UTF-32/UCS-4, little endian
> 00 00 FE FF - UTF-32/UCS-4, big-endian
>
> When you save a plain text file as Unicode from Notepad (Windows XP)
> it adds this mark at the beginning of the file. But then if you add
> that file to a Subversion repository, it's marked as
> application/octet-stream. If you remove the byte-order mark and add it
> again (under a different name, of course), it doesn't mark it as
> application/octet-stream.

That's perplexing. Here's how we determine whether a file is binary
right now:

  /* Right now, this function is going to be really stupid. It's
     going to examine the first block of data, and make sure that 85%
     of the bytes are such that their value is in the ranges 0x07-0x0D
     or 0x20-0x7F, and that 100% of those bytes is not 0x00.

     If those criteria are not met, we're calling it binary. */

For UTF-8 text, the byte-order marker might nudge the count of non-ASCII
bytes just enough to make the first 1024 bytes less than 85% ASCII, but
most of the time, it shouldn't matter. For UTF-16 or UTF-32 text, there
are going to be a pile of zero bytes in there anyway, so it will look
binary regardless.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Mar 6 23:24:15 2004

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.