Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 15:24:58 -0800

2009/11/12 Stefan Sperling <stsp_at_elego.de>:
> On Thu, Nov 12, 2009 at 02:02:09PM -0800, Mike Samuel wrote:
>> Proposal:
>> Â Â Classify a file as text if any of the following are true
>> Â Â (1) the existing classification algorithm classifies it as text
>> Â Â (2) the svn:mime-type property includes a attribute ";charset=..."
>> with a non-blank charset name.
>
> I like the idea.
>
>> Variant 1:
>> Append the criteria above with
>> Â Â (3) Use the charset from svn:charset if there is none from (2)
>> See http://svn.haxx.se/dev/archive-2008-06/0941.shtml
>
> I'd say a separate svn:charset property is a different can of worms,
> and can be dealt with later if neccessary.

Ok. Unless someone wants to argue for it, let's consider it tabled.

>> Variant 1:
>> If the charset from (2/3) is not recognized, ignore it.
>
> I guess by "ignore it" you mean "use whatever result check (1) produced"?
> Problem with this: Now we need to maintain a list of charsets in svn
> rather than a list of mime-types.

Good point. So differences in charsets available on different
platforms would complicate the "should not be overly sensitive to
version of svn" clause in goal.

>> Variant 2:
>> If the charset from (2/3) is recognized but the file is not correctly
>> encoded with it.
>> If the charset is "UTF-8" and the byte sequence contains a sequence
>> not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.
>
> I don't think we should be making special rules for UTF-8.
> To be consistent we'd have to verify other charsets also, such as
> ASCII (no 0x80 bit set), or various other multi-byte charsets.
> And the only way to properly detect a valid UTF-8 sequence is to try
> to decode it completely. So I don't think we should be looking at
> file content, and simply trust users to specify the charset correctly.

I was not arguing for special-casing UTF-8. UTF-8 is just one
encoding that has some parity built in so can error on certain byte
sequences. UTF-16 has the same around orphaned surrogates.
But your argument that charset support varies widely by platform is
equally apt here.

> Stefan
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417310
Received on 2009-11-13 00:25:11 CET

This message: [ Message body ]
Next message: Stefan Sperling: "Re: Classifying files as binary or text"
Previous message: Mike Samuel: "Re: Classifying files as binary or text"
In reply to: Stefan Sperling: "Re: Classifying files as binary or text"
Next in thread: Mark Phippard: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]