Re: Classifying files as binary or text

From: Stefan Sperling <stsp_at_elego.de>
Date: Thu, 12 Nov 2009 23:34:19 +0100

On Thu, Nov 12, 2009 at 02:02:09PM -0800, Mike Samuel wrote:
> Proposal:
> Classify a file as text if any of the following are true
> (1) the existing classification algorithm classifies it as text
> (2) the svn:mime-type property includes a attribute ";charset=..."
> with a non-blank charset name.

I like the idea.

> Variant 1:
> Append the criteria above with
> (3) Use the charset from svn:charset if there is none from (2)
> See http://svn.haxx.se/dev/archive-2008-06/0941.shtml

I'd say a separate svn:charset property is a different can of worms,
and can be dealt with later if neccessary.

> Variant 1:
> If the charset from (2/3) is not recognized, ignore it.

I guess by "ignore it" you mean "use whatever result check (1) produced"?
Problem with this: Now we need to maintain a list of charsets in svn
rather than a list of mime-types.

> Variant 2:
> If the charset from (2/3) is recognized but the file is not correctly
> encoded with it.
> If the charset is "UTF-8" and the byte sequence contains a sequence
> not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.

I don't think we should be making special rules for UTF-8.
To be consistent we'd have to verify other charsets also, such as
ASCII (no 0x80 bit set), or various other multi-byte charsets.
And the only way to properly detect a valid UTF-8 sequence is to try
to decode it completely. So I don't think we should be looking at
file content, and simply trust users to specify the charset correctly.

Stefan

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417292
Received on 2009-11-12 23:34:41 CET

This message: [ Message body ]
Next message: Jack Repenning: "Re: [RFC] mailing list host"
Previous message: C. Michael Pilato: "Re: A tiny request for the repository move"
In reply to: Mike Samuel: "Classifying files as binary or text"
Next in thread: Mike Samuel: "Re: Classifying files as binary or text"
Reply: Mike Samuel: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]